2018 Edition

Estimating Financial Risk through Monte Carlo Simulation

Risk analysis is part of every decision we make when faced with uncertainty, ambiguity, and variability. Indeed, even though we have unprecedented access to information, we can't accurately predict the future. In finance, there is a fair amount of uncertainty and risk involved with estimating the future value of financial products, due to the wide variety of potential outcomes. Monte Carlo simulation (also known as the Monte Carlo Method) allows inspecting many possible outcomes of the decision making process, and can be used to assess the impact of risk: this, in turns, allows for better decision-making under uncertainty.

Goals

The main objectives we set for this Notebook are as follows:

  1. Develop fundamental knowledge about Risk analysis
  2. Understand Monte Carlo Simulation (MCS)
  3. Apply Monte Carlo Simulation for predicting risk

Steps

  1. First, in section 1, we introduce the basics of MCS
  2. In section 2, we work on a simple example to where we apply the MCS method
  3. In section 3, we briefly summarize the main characteristics of the Monte Carlo Simulation (MCS) technique
  4. In section 4, we overview the common distributions which are often used in MCS
  5. In section 5, we work on a real use case, that focuses on estimating financial risk. We will use techniques such as featurization (that is, generating additional features to improve model accuracy), linear regression, kernel density estimation, sampling distributions and so on ...

Reference

This Notebook is inspired by Chapter 9 of the book Advanced Analytics with Spark by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. It is strongly suggested to read this Chapter to get a general idea of the topic of this Notebook.

1. Introduction

1.1. Monte Carlo Simulation (MCS)

Monte Carlo simulation is a computerized mathematical technique that can be applied such that it is possible to account for risk in quantitative analysis and decision making. This technique is used in many different fields, such as R&D, risk management, portfolio management, pricing derivatives, strategic planning, project planning, cost modeling and many more.

In general, MCS is a technique that "converts" uncertainty on input variables of a model into probability distributions. By combining the distributions and randomly selecting values from them, it recalculates the simulated model many times, to determine the probability of the output.

Historically, this technique was first used by scientists working on the atomic bomb: it was named after Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

1.2. How does it work?

Monte Carlo simulation performs risk analysis by building models of possible results by substituting a range of possible input values, that constitute uncertainty, into a statistical distribution. It then computes possible outcomes repeatedly, each time using a different set of random values from the probability functions that "model" the input. Depending upon the number of random input variables and their distribution, a Monte Carlo simulation could involve thousands or tens of thousands of "rounds" before it is complete. When complete, Monte Carlo simulation produces distributions of possible outcome values.

By using probability distributions instead of actual input samples, it is possible to model more accurately uncertainty: different choices of distributions will yield different outputs.

2. Illustrative example

Imagine you are the marketing manager for a firm that is planning to introduce a new product. You need to estimate the first-year net profit from this product, which might depend on:

  • Sales volume in units
  • Price per unit (also called "Selling price")
  • Unit cost
  • Fixed costs

Net profit will be calculated as $Net Profit = Sales Volume* (Selling Price - Unit cost) - Fixed costs$. Fixed costs (accounting for various overheads, advertising budget, etc.) are known to be \$ 120,000, which we assume to be deterministic. All other factors, instead, involve some uncertainty: sales volume (in units) can cover quite a large range, the selling price per unit will depend on competitor actions, which are hard to predict, and unit costs will also vary depending on vendor prices and production experience, for example.

Now, to build a risk analysis model, we must first identify the uncertain variables -- which are essentially random variables. While there's some uncertainty in almost all variables in a business model, we want to focus on variables where the range of values is significant.

2.1. Unit sales and unit price

Based on a hypothetical market research you have done, you have beliefs that there are equal chances for the market to be slow, normal, or hot:

  • In a "slow" market, you expect to sell 50,000 units at an average selling price of \$11.00 per unit
  • In a "normal" market, you expect to sell 75,000 units, but you'll likely realize a lower average selling price of \$10.00 per unit
  • In a "hot" market, you expect to sell 100,000 units, but this will bring in competitors, who will drive down the average selling price to \$8.00 per unit
In [2]:
#Import Libraries
import random
import matplotlib.pyplot as plt
from pandas.core import datetools
from datetime import datetime
from datetime import timedelta
from itertools import islice
%matplotlib inline
import numpy as np
import statsmodels.api as sm
from scipy.stats import norm
import pylab 
import scipy.stats as stats
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from statsmodels.nonparametric.kde import KDEUnivariate
import matplotlib.pyplot as plt
import scipy
import seaborn as sns
from scipy import stats
import statsmodels.sandbox.distributions.multivariate as mul
from sklearn.mixture import GaussianMixture
import math

Question 1

Calculate the average units and the unit price that you expect to sell, which depend on the market state. Use the assumptions above to compute the expected quantity of products and their expected unit price.
In [3]:
average_unit = (50000+75000+100000)/3
average_price = (11+10+8)/3
print("average unit:", average_unit)
print("average_price:", average_price)
average unit: 75000.0
average_price: 9.666666666666666

2.2. Unit Cost

Another uncertain variable is Unit Cost. In our illustrative example, we assume that your firm's production manager advises you that unit costs may be anywhere from \$5.50 to \$7.50, with a most likely expected cost of \$6.50. In this case, the most likely cost can be considered as the average cost.

2.3. A Flawed Model: using averages to represent our random variables

Our next step is to identify uncertain functions -- also called functions of a random variable. Recall that Net Profit is calculated as $Net Profit = Sales Volume * (Selling Price - Unit cost) - Fixed costs$. However, Sales Volume, Selling Price and Unit Cost are all uncertain variables, so Net Profit is an uncertain function.

The simplest model to predict the Net Profit is using average of sales volume, average of selling price and average of unit cost for calculating. So, if only consider averages, we can say that the $Net Profit = 75,000*(9.66666666 - 6.5) - 120,000 \sim 117,500$.

However, as Dr. Sam Savage warns, "Plans based on average assumptions will be wrong on average." The calculated result is far from the actual value: indeed, the true average Net Profit is roughly \$93,000, as we will see later in the example.

Question 2

Question 2.1

Write a function named `calNetProfit` to calculate the Net Profit using the average of sales volume, the average of selling price and the average of unit cost.
In [4]:
def calNetProfit(average_unit, average_price, average_unitcost, fixed_cost):
    return average_unit*(average_price-average_unitcost)-fixed_cost

average_unitcost = 6.5
fixed_cost = 120000
NetProfit = calNetProfit(average_unit,average_price,average_unitcost,fixed_cost)
print("Net profit:", NetProfit)
Net profit: 117499.99999999994

Question 2.2

Verify the warning message of Dr. Sam Savage by calculating the error of our estimated Net Profit using averages only. Recall that the true value is roughly \$93,000, so we are interested in:
    $$ error = \frac{your\_value - true\_value}{true\_value}$$
      Note also we are interested in displaying the error as a percentage.
        Looking at the error we make, do you think that we can use the current model that only relies on averages?
        In [5]:
        trueNetProfit = 93000
        error = (NetProfit - trueNetProfit) / (trueNetProfit)
        print("Error in percentage:", error * 100)
        
        Error in percentage: 26.344086021505316
        
        We see here that the error is very high. This means we can't depend on averages for estimating the net profit. An average can't represent an uncertain quantity because it ignores the impact of inevitable variations.

        2.4. Using the Monte Carlo Simulation method to improve our model

        As discussed before, the selling price and selling volume both depend on the state of the market scenario (slow/normal/hot). So, the net profit is the result of two random variables: market scenario (which in turn determines sales volumes and selling price) and unit cost.

        Now, let's assume (this is an a-priori assumption we make) that market scenario follows a discrete, uniform distribution and that unit cost also follows a uniform distribution. Then, we can compute directly the values for selling price and selling volumes based on the outcome of the random variable market scenario, as shown in Section 2.1.

        From these a-priori distributions, in each run (or trial) of our Monte Carlo simulation, we can generate the sample value for each random variable and use it to calculate the Net Profit. The more simulation runs, the more accurate our results will be. For example, if we run the simulation 100,000 times, the average net profit will amount to roughly \$92,600. Every time we run the simulation, a different prediction will be output: the average of such predictions will consistently be less than \$117,500, which we predicted using averages only.

        Note also that in this simple example, we generate values for the market scenario and unit cost independently: we consider them to be independent random variables. This means that the eventual (and realistic!) correlation between the market scenario and unit cost variables is ignored. Later, we will learn how to be more precise and account for dependency between random variables.

        Question 3

        Question 3.1

        Write a function named `get_sales_volume_price` that returns the sales volume and price based on the market scenario. In particular, the scenario can get one of three values:
        • 0: Slow market
        • 1: Normal market
        • 2: Hot market
        The return value is a tuple in the form: `(sales_volume, price)`
        In [6]:
        # Get sales volume and  price based on market scenario
        # the function returns a tuple of (sales_volume, price)
        def get_sales_volume_price(scenario):
            # Slow market
            if scenario == 0:
                return (50000,11)
            # Normal market
            elif scenario ==1:
                return (75000,10)
            # Hot market
            else:
                return (100000,8)
        

        Question 3.2

        Run 100,000 Monte Carlo simulations and calculate the average net profit they produce. Then, compare the result to the "average model" we used in the previous questions (the one we called "flawed" model). Put your comments about the discrepancies between a simplistic model, and the more accurate MCS approach.
          Note that in each iteration, the `unit_cost` and `market_scenario` are generated according to their distributions. Also, recall what we have seen in Section 2.2: your firm account manager helped you with some research, to determine the variability of your random variables.
          HINT

          Function uniform(a,b) in module random generates a number $a<=c<=b$, which is drawn from a uniform distribution.

          Function randint(a,b) helps you generating an integer number $a<=c<=b$

          In [8]:
          profit=[]
          num_simulations = [10,1000,10000,100000,1000000]
          for num_simulation in num_simulations:
              total=0.0
              for i in range(0,num_simulation):
                  unit_cost = random.uniform(5.5,7.5)
                  market_scenario = random.randint(0,2)
                  sales_volume, price = get_sales_volume_price(market_scenario)
                  netProfit = calNetProfit(sales_volume, price, unit_cost, fixed_cost)
                  total += netProfit/num_simulation
              profit.append(total)
              print("Number of Simulations = {} ===> Average Net Profit = {}".format(num_simulation, total))
              print("Error percentage: ",100*abs(total-trueNetProfit)/trueNetProfit)
              print("********************************************************************************")
          plt.bar(range(len(profit)),profit, color = "skyblue")
          plt.ylim([70000,130000])
          plt.xticks(range(len(profit)),num_simulations)
          plt.axhline(y=trueNetProfit,linestyle='--')
          plt.xlabel("Number of Simulations")
          plt.ylabel("Average Net Profit")
          plt.show()
          
          Number of Simulations = 10 ===> Average Net Profit = 86345.47500138485
          Error percentage:  7.155403224317362
          ********************************************************************************
          Number of Simulations = 1000 ===> Average Net Profit = 88129.15064973364
          Error percentage:  5.23747241964125
          ********************************************************************************
          Number of Simulations = 10000 ===> Average Net Profit = 91884.52543997097
          Error percentage:  1.199435010783907
          ********************************************************************************
          Number of Simulations = 100000 ===> Average Net Profit = 92287.90302585682
          Error percentage:  0.765695671121695
          ********************************************************************************
          Number of Simulations = 1000000 ===> Average Net Profit = 92416.68403340952
          Error percentage:  0.6272214694521315
          ********************************************************************************
          
          • We see here that modeling using MCS produced a way better result than using the averages.
          • A high number of simulations of uncertain input variables produces an output variable which is very close to the actual value.

          3. A brief summary of the Monte Carlo Simulation (MCS) technique

          • A MCS allows several inputs to be used at the same time to compute the probability distribution of one or more outputs
          • Different types of probability distributions can be assigned to the inputs of the model, depending on any a-priori information that is available. When the distribution is completely unknown, a common technique is to use a distribution computed by finding the best fit to the data you have
          • The MCS method is also called a stochastic method because it uses random variables. Note also that the general assumption is for input random variables to be independent from each other. When this is not the case, there are techniques to account for correlation between random variables.
          • A MCS generates the output as a range instead of a fixed value and shows how likely the output value is to occur in that range. In other words, the model outputs a probability distribution.

          4. Common distributions used in MCS

          In what follows, we summarize the most common probability distributions that are used as a-priori distributions for input random variables:

          • Normal/Gaussian Distribution: this is a continuous distribution applied in situations where the mean and the standard deviation of a given input variable are given, and the mean represents the most probable value of the variable. In other words, values "near" the mean are most likely to occur. This is symmetric distribution, and it is not bounded in its co-domain. It is very often used to describe natural phenomena, such as people’s heights, inflation rates, energy prices, and so on and so forth. An illustration of a normal distribution is given below: normal_distribution

          • Lognormal Distribution: this is a distribution which is appropriate for variables taking values in the range $[0, \infty]$. Values are positively skewed, not symmetric like a normal distribution. Examples of variables described by some lognormal distributions include, for example, real estate property values, stock prices, and oil reserves. An illustration of a lognormal distribution is given below: log_normal_distribution

          • Triangular Distribution: this is a continuous distribution with fixed minimum and maximum values. It is bounded by the minimum and maximum values and can be either symmetrical (the most probable value = mean = median) or asymmetrical. Values around the most likely value (e.g. the mean) are more likely to occur. Variables that could be described by a triangular distribution include, for example, past sales history per unit of time and inventory levels. An illustration of a triangular distribution is given below:

          • Uniform Distribution: this is a continuous distribution bounded by known minimum and maximum values. In contrast to the triangular distribution, the likelihood of occurrence of the values between the minimum and maximum is the same. In other words, all values have an equal chance of occurring, and the distribution is simply characterized by the minimum and maximum values. Examples of variables that can be described by a uniform distribution include manufacturing costs or future sales revenues for a new product. An illustration of the uniform distribution is given below:

          • Exponential Distribution: this is a continuous distribution used to model the time that pass between independent occurrences, provided that the rate of occurrences is known. An example of the exponential distribution is given below:

          • Discrete Distribution : for this kind of distribution, the "user" defines specific values that may occur and the likelihood of each of them. An example might be the results of a lawsuit: 20% chance of positive verdict, 30% change of negative verdict, 40% chance of settlement, and 10% chance of mistrial.

          5. A real use case: estimating the financial risk of a portfolio of stocks

          We hope that by now you have a good understanding about Monte Carlo simulation. Next, we apply this method to a real use case: financial risk estimation.

          Imagine that you are an investor on the stock market. You plan to buy some stocks and you want to estimate the maximum loss you could incur after two weeks of investing. This is the quantity that the financial statistic "Value at Risk" (VaR) seeks to measure. VaR is defined as a measure of investment risk that can be used as a reasonable estimate of the maximum probable loss for a value of an investment portfolio, over a particular time period. A VaR statistic depends on three parameters: a portfolio, a time period, and a confidence level. A VaR of 1 million dollars with a 95% confidence level over two weeks, indicates the belief that the portfolio stands only a 5% chance of losing more than 1 million dollars over two weeks. VaR has seen widespread use across financial services organizations. This statistic plays a vital role in determining how much cash investors must hold to meet the credit ratings that they seek. In addition, it is also used to understand the risk characteristics of large portfolios: it is a good idea to compute the VaR before executing trades, such that it can help take informed decisions about investments.

          Our goal is calculating VaR of two weeks interval with 95% confidence level and the associated VaR confidence interval.

          5.1. Terminology

          In this use case, we will use some terms that might require a proper definition, given the domain. This is what we call the Domain Knowledge.

          • Instrument: A tradable asset, such as a bond, loan, option, or stock investment. At any particular time, an instrument is considered to have a value, which is the price for which it can be sold. In the use case of this notebook, instruments are stock investments.
          • Portfolio: A collection of instruments owned by a financial institution.
          • Return: The change in an instrument or portfolio’s value over a time period.
          • Loss: A negative return.
          • Index: An imaginary portfolio of instruments. For example, the NASDAQ Composite index includes about 3,000 stocks and similar instruments for major US and international companies.
          • Market factor: A value that can be used as an indicator of macro aspects of the financial climate at a particular time. For example, the value of an index, the Gross Domestic Product of the United States, or the exchange rate between the dollar and the euro. We will often refer to market factors as just factors.

          5.2. The context of our use case

          We have a list of instruments that we plan to invest in. The historical data of each instrument has been collected for you. For simplicity, assume that the returns of instruments at a given time, depend on 4 market factors only:

          • GSPC value
          • IXIC value
          • The return of crude oil
          • The return of treasury bonds

          Our goal is building a model to predict the loss after two weeks' time interval with confidence level set to 95%.

          As a side note, it is important to realize that the approach presented in this Notebook is a simplified version of what would happen in a real Financial firm. For example, the returns of instruments at a given time often depend on more than 4 market factors only! Moreover, the choice of what constitute an appropriate market factor is an art!

          5.3. The Data

          The stock data can be downloaded (or scraped) from Yahoo! by making a series of REST calls. The data includes multiple files. Each file contains the historical information of each instrument that we want to invest in. The data is in the following format (with some samples):

          Date, Open, High, Low, Close, Volume, Adj Close
          2016-01-22,66.239998,68.07,65.449997,67.860001,137400,67.860001
          2016-01-21,65.410004,66.18,64.459999,65.050003,148000,65.050003
          2016-01-20,64.279999,66.32,62.77,65.389999,141300,65.389999
          2016-01-19,67.720001,67.989998,64.720001,65.379997,178400,65.379997

          The data of GSPC and IXIC values (our two first market factors) are also available on Yahoo! and use the very same format.

          The crude oil and treasure bonds data is collected from investing.com, and has a different format, as shown below (with some samples):

          Date    Price   Open    High    Low     Vol.    Change %
          Jan 25, 2016    32.17   32.36   32.44   32.10   -       -0.59%
          Jan 24, 2016    32.37   32.10   32.62   31.99   -       0.54%
          Jan 22, 2016    32.19   29.84   32.35   29.53   -       9.01%
          Jan 21, 2016    29.53   28.35   30.25   27.87   694.04K 11.22%
          Jan 20, 2016    26.55   28.33   28.58   26.19   32.11K  -6.71%
          Jan 19, 2016    28.46   29.20   30.21   28.21   188.03K -5.21%

          In our use case, the factors' data will be used jointly to build a statistical model: as a consequence, we first need to preprocess the data to proceed.

          5.4. Data preprocessing

          In this Notebook, all data files have been downloaded for you, such that you can focus on pre-processing. Next, we will:

          • Read the factor data files which are in two different formats, process and merge them together
          • Read the stock data and pre-process it
          • Trim all data into a specific time region
          • Fill in the missing values
          • Generate the data of returns in each two weeks' time interval window

          Factor data pre-processing

          We need two functions to read and parse data from Yahoo! and Investing.com respectively. We are interested only in information about the time and the corresponding returns of a factor or an instrument: as a consequence, we will project away many columns of our RAW data, and keep only the information we are interested in.

          The 3000-instrument and the 4-factor history are small enough to be read and processed locally: we do not need to use the power of parallel computing to proceed. Note that this is true also for larger cases with hundreds of thousands of instruments and thousands of factors. The need for a distributed system like Spark comes in when actually running the Monte Carlo simulations, which can require massive amounts of computation on each instrument.

          Question 4

          Question 4.1

          Write a function named `readInvestingDotComHistory` to parse data from investing.com based on the format specified above (see Section 5.3). Recall that we use two factors here: one that is related to the price of crude oil, one that is related to some specific US bonds.
            Print the first 5 entries of the first factor (crude oil price) in the parsed data.
              Note that we are only interested in the date and price of stocks.

              HINT

              You can parse a string to datetime object by using the function strptime(<string>, <dtime_format>). In this case, the datetime format is "%b %d, %Y". For more information, please follow this link.

              In the next cell, we simply copy data from our HDFS cluster (that contains everything we need for this Notebook) to the instance (a Docker container) running your Notebook. This means that you will have "local" data that you can process without using Spark. Note the folder location: find and verify that you have correctly downloaded the files!

              In [ ]:
              ! [ -d monte-carlo-risk ] || (echo "Downloading prepared data from HDFS. Please wait..." ; hdfs dfs -copyToLocal /datasets/monte-carlo-risk . ; echo "Done!";)
              
              In [8]:
              ls
              
              MC.ipynb*  monte-carlo-risk/
              
              In [9]:
              base_folder = "monte-carlo-risk/"
              
              factors_folder= base_folder + "factors/"
              
              # read data from local disk
              def readInvestingDotComHistory(fname):
                  def process_line(line):
                      cols = line.split("\t")
                      date = datetime.strptime(cols[0], "%b %d, %Y")
                      value = float(cols[1])
                      return (date, value)
                  
                  with open(fname) as f:
                      content_w_header = f.readlines()
                      # remove the first line 
                      # and reverse lines to sort the data by date, in ascending order
                      content = content_w_header[:0:-1]
                      return list(map(process_line , content))
              
              factor1_files = ['crudeoil.tsv', 'us30yeartreasurybonds.tsv']
              factor1_files = map(lambda fn: factors_folder + fn, factor1_files)
              factors1 = [readInvestingDotComHistory(f) for f in factor1_files]
              
              print(factors1[0][:5])
              
              [(datetime.datetime(2006, 1, 26, 0, 0), 66.26), (datetime.datetime(2006, 1, 27, 0, 0), 67.76), (datetime.datetime(2006, 1, 30, 0, 0), 68.35), (datetime.datetime(2006, 1, 31, 0, 0), 67.92), (datetime.datetime(2006, 2, 1, 0, 0), 66.56)]
              
              Let's have a look at these factors.
              In [10]:
              #Let's plot the factors
              
              
              #plot crude oil
              
              dates=list(zip(*factors1[0]))[0]  #get the dates
              stock_values=list(zip(*factors1[0]))[1]    #get the stock values
              plt.figure(figsize=(70,10))
              plt.xticks(fontsize=40)
              plt.yticks(fontsize=40)
              plt.plot(dates,stock_values)
              plt.title("Crude Oil",fontsize=50)
              plt.show()
              
              
              #plot treasury bond
              dates=list(zip(*factors1[1]))[0]  #get the dates
              stock_values=list(zip(*factors1[1]))[1]    #get the stock values 
              plt.figure(figsize=(70,10))
              plt.xticks(fontsize="40")
              plt.yticks(fontsize="40")
              plt.plot(dates,stock_values)
              plt.title("Treasury Bonds",fontsize=50)
              plt.show()
              
              <matplotlib.figure.Figure at 0x7fb8ff105f60>
              Lets' plot them on the same graph to get a better view. We must:
              • unite the dates
              • normalize the values
              Let's find a common date for both.
              Crude oil data have a starting date in 2006 while treasury bonds have a starting date in 2009.

              Let's find the index where the date of crude oil values is equal to the forst dates of treasury bond.
              In [11]:
              #get dates of CO and TB
              dates_CO=list(zip(*factors1[0]))[0]  
              dates_TB=list(zip(*factors1[1]))[0] 
              
              In [12]:
              #find index where date of CO equals first date of TB
              ind_CO=[i for i in range(len(dates_CO)) if dates_CO[i]==dates_TB[0]][0]
              ind_CO
              
              Out[12]:
              513
              In [14]:
              dates_CO=dates_CO[ind_CO:] #unite the start date of both factors
              
              In [11]:
              #function to normalize data
              def normalise(x):
                  if(len(x)>1):
                      x=np.array(x)
                      x=(x-min(x))/(max(x)-min(x))
                  return x
              
              In [16]:
              #Let's plot both
              
              #get the stock values
              stock_values_CO=normalise(list(zip(*factors1[0]))[1][ind_CO:]) 
              stock_values_TB=normalise(list(zip(*factors1[1]))[1])
              
              #plotting
              plt.figure(figsize=(160,80))
              plt.xticks(fontsize=80)
              plt.yticks(fontsize=80)
              plt.plot(dates_CO,stock_values_CO,linewidth=5)
              plt.plot(dates_TB,stock_values_TB,linewidth=5)
              plt.title("Crude Oil and Treasury Bonds (normalised)",fontsize=100)
              plt.show()
              
              • We can see that in general, the values of both stocks are decreasing. However,both have had their ups and downs through out the years.
              • We can see that both decreased a lot in 2009 or a little bit before that. This is probably due to the financial crisis that hit the world back then.

              Now, the data structure factors1 is a list, containing data that pertains to two (out of a total of four) factors that influence the market, as obtained by investing.com. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. From now on, we call these elements "records" or "entries". Visually, factors1 looks like this:

              0 (crude oil) 1 (US bonds)
              time_stamp, value time_stamp, value
              ... ...
              time_stamp, value time_stamp, value
              ... ...

              Question 4.2

              Write a function named `readYahooHistory` to parse data from yahoo.com based on its format, as described in Section 5.3.
                Print the first 5 entries of the first factor (namely GSPC). Comment the time range of the second batch of data we use in our Notebook.
                  Note that we are only interested in the date and price of stocks.

                  NOTE
                  The datetime format now is in a different format than the previous one.

                  HINT
                  Use a terminal (or put the bash commands inline in your Notebook) to list filenames in your local working directory to find and have a look at your local files.

                  In [10]:
                  # read data from local disk
                  def readYahooHistory(fname):
                      def process_line(line):
                          cols=line.split(",")
                          date=datetime.strptime(cols[0],"%Y-%m-%d")
                          value=float(cols[6])
                          return(date,value)
                      
                      with open(fname) as f:
                          lines_w_header=f.readlines()
                          lines=lines_w_header[:0:-1]
                          return(list(map(process_line,lines)))
                          #...
                      
                  
                  factor2_files = ["GSPC.csv", "IXIC.csv"]
                  factor2_files = map(lambda fn: factors_folder + fn, factor2_files)
                  
                  factors2 = [readYahooHistory(f) for f in factor2_files]
                  
                  print(factors2[0][:5])
                  print("\n")
                  print(factors2[-1][-5:])
                  
                  [(datetime.datetime(1950, 1, 3, 0, 0), 16.66), (datetime.datetime(1950, 1, 4, 0, 0), 16.85), (datetime.datetime(1950, 1, 5, 0, 0), 16.93), (datetime.datetime(1950, 1, 6, 0, 0), 16.98), (datetime.datetime(1950, 1, 9, 0, 0), 17.08)]
                  
                  
                  [(datetime.datetime(2016, 1, 15, 0, 0), 4488.419922), (datetime.datetime(2016, 1, 19, 0, 0), 4476.950195), (datetime.datetime(2016, 1, 20, 0, 0), 4471.689941), (datetime.datetime(2016, 1, 21, 0, 0), 4472.060059), (datetime.datetime(2016, 1, 22, 0, 0), 4591.180176)]
                  
                  In [18]:
                  #Let's plot the factors
                  
                  
                  
                  
                  #plot crude oil
                  dates=list(zip(*factors2[0]))[0]  #get the dates
                  stock_values=list(zip(*factors2[0]))[1]    #get the stock values
                  plt.figure(figsize=(70,10))
                  plt.xticks(fontsize=40)
                  plt.yticks(fontsize=40)
                  plt.plot(dates,stock_values)
                  plt.title("GSPC",fontsize=50)
                  plt.show()
                  
                  
                  #plot treasury bond
                  dates=list(zip(*factors2[1]))[0]  #get the dates
                  stock_values=list(zip(*factors2[1]))[1]    #get the stock values 
                  plt.figure(figsize=(70,10))
                  plt.xticks(fontsize=40)
                  plt.yticks(fontsize=40)
                  plt.plot(dates,stock_values)
                  plt.title("IXIC",fontsize=50)
                  plt.show()
                  
                  We can see that there is a vast difference in the dates of the two datasets. The data from Investing.com contains dates from 2006 while the one from Yahoo.com contains dates from 1950.
                  Let's try to visualise all the factors together!
                  We will plot with the dates that we worked on with Crude Oil and Treasury Bonds.
                  In [19]:
                  #get dates for GSPC and IXIC
                  dates_GSPC=list(zip(*factors2[0]))[0] 
                  dates_IXIC=list(zip(*factors2[1]))[0] 
                  
                  In [20]:
                  #find index where date of GSPC equals first date of TB
                  ind_GSPC=[i for i in range(len(dates_GSPC)) if dates_GSPC[i]==dates_TB[0]][0]
                  ind_GSPC
                  
                  Out[20]:
                  14620
                  In [22]:
                  #find index where date of IXIC equals first date of TB
                  ind_IXIC=[i for i in range(len(dates_IXIC)) if dates_IXIC[i]==dates_TB[0]][0]
                  ind_IXIC
                  
                  Out[22]:
                  9339
                  In [23]:
                  dates_GSPC=dates_GSPC[ind_GSPC:] #unite start date
                  dates_IXIC=dates_IXIC[ind_IXIC:] #unite start date
                  stock_values_GSPC=normalise(list(zip(*factors2[0]))[1][ind_GSPC:]) #get GSPC stock values
                  stock_values_IXIC=normalise(list(zip(*factors2[1]))[1][ind_IXIC:]) #get IXIC stock values
                  
                  In [24]:
                  plt.figure(figsize=(160,80))
                  plt.xticks(fontsize=80)
                  plt.yticks(fontsize=80)
                  plt.plot(dates_CO,stock_values_CO,linewidth=5,label="Crude Oil")
                  plt.plot(dates_TB,stock_values_TB,linewidth=5,label="Treasury Bonds")
                  plt.plot(dates_GSPC,stock_values_GSPC,linewidth=5,label="GSPC")
                  plt.plot(dates_IXIC,stock_values_IXIC,linewidth=5,label="IXIC")
                  plt.title("Market Factors (normalised)",fontsize=100)
                  plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.,fontsize=130)
                  plt.show()
                  
                  We can see that,cin a general sense, Crude Oil and Treasury Bonds tend to decrease with time, while GSPC and IXIC tend to increase. However, all market factors decreased in 2009.

                  Now, the data structure factors2 is again list, containing data that pertains to the next two (out of a total of four) factors that influence the market, as obtained by Yahoo!. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. Visually, factors2 looks like this:

                  0 (GSPC) 1 (IXIC)
                  time_stamp, value time_stamp, value
                  ... ...
                  time_stamp, value time_stamp, value
                  ... ...

                  Stock data pre-processing

                  Next, we prepare the data for the instruments we consider in this Notebook (i.e., the stocks we want to invest in).

                  Question 4.3

                  In this Notebook, we assume that we want to invest on the first 35 stocks out of the total 3000 stocks present in our datasets.
                    Load and prepare all the data for the considered instruments (the first 35 stocks) which have historical information for more than 5 years. This means that all instruments with less than 5 years of history should be removed.

                    HINT
                    we suggest to open a terminal window (not on your local machine, but the Notebook terminal that you can find on the Jupyter dashboard) and visually check the contents of the directories holding our dataset, if you didn't do this before! Have a look at how stock data is organized!

                    In [25]:
                    #number of all given stocks
                    ls -al monte-carlo-risk/stocks/ | wc -l
                    
                    2713
                    
                    In [11]:
                    from os import listdir
                    stock_folder = base_folder + 'stocks'
                    from os.path import isfile, join
                    
                    In [12]:
                    stock_names=[w[:-4] for w in listdir(stock_folder)] #list containing stock names
                    
                    In [13]:
                    def process_stock_file(fname):
                        try:
                            return readYahooHistory(fname)
                        except Exception as e:
                            raise e
                            return None
                    
                    
                    
                    # select path of all stock data files in "stock_folder"
                    files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                    
                    # assume that we invest only the first 35 stocks (for faster computation)
                    files = files[:35]
                    
                    # read each line in each file, convert it into the format: (date, value)
                    rawStocks = [process_stock_file(f) for f in files]
                    
                    # select only instruments which have more than 5 years of history
                    # Note: the number of business days in a year is 260
                    
                    number_of_years = 5
                    
                    stock_names_small=[stock_names[i] for i in range(len(files)) if(len(rawStocks[i])>260*5) ]
                    
                    rawStocks = list(filter(lambda instrument:len(instrument)>260*5 , rawStocks))
                    
                    # For testing, print the first 5 entry of the first stock
                    print(rawStocks[0][:5])
                        
                        
                    
                        
                        
                    
                    [(datetime.datetime(1997, 8, 14, 0, 0), 6.011436), (datetime.datetime(1997, 8, 15, 0, 0), 6.473854), (datetime.datetime(1997, 8, 18, 0, 0), 7.47576), (datetime.datetime(1997, 8, 19, 0, 0), 7.39869), (datetime.datetime(1997, 8, 20, 0, 0), 7.39869)]
                    
                    In [14]:
                    len(rawStocks)
                    
                    Out[14]:
                    29
                    We can see here that the number of stocks decreased from 35 stocks to 29 stocks.
                    In [31]:
                    #Let's plot the stocks
                    
                    
                    plt.figure(figsize=(70,140))
                    for i in range(len(rawStocks)):
                        plt.subplot(10,3,i+1)
                        dates=list(zip(*rawStocks[i]))[0]  #get the dates
                        stock_values=list(zip(*rawStocks[i]))[1]    #get the stock values
                        plt.xticks(fontsize=30)
                        plt.yticks(fontsize=30)
                        plt.plot(dates,stock_values)
                        plt.title("{}".format(stock_names_small[i]),fontsize=40)
                    

                    Time alignment for our data

                    Different types of instruments may trade on different days, or the data may have missing values for other reasons, so it is important to make sure that our different histories align. First, we need to trim all of our time series to the same region in time. Then, we need to fill in missing values. To deal with time series that have missing values at the start and end dates in the time region, we simply fill in those dates with nearby values in the time region.

                    Question 4.4

                    Assume that we only focus on the data from 23/01/2009 to 23/01/2014. Write a function named `trimToRegion` to select only the records in that time interval.
                      **Requirements**: after processing, each instrument $i$ has a list of records: $[r_0, r_2,...,r_{m_i}]$ such that $r_0$ and $r_{m_i}$ are assigned, respectively, the first and the last values corresponding to the extremes of the given time interval. For example: $r_0$ should contain the value at date 23/01/2009.
                      In [15]:
                      # note that the data of crude oil and treasury is only available starting from 26/01/2006 
                      start = datetime(year=2009, month=1, day=23)
                      end = datetime(year=2014, month=1, day=23)
                      
                      def trimToRegion(history, start, end):
                          def isInTimeRegion(entry):
                              (date, value) = entry
                              return date >= start and date <= end
                      
                          # only select entries which are in the time region
                          trimmed = list(filter( isInTimeRegion, history))
                          
                          # if the data has incorrect time boundaries, add time boundaries
                          if trimmed[0][0] != start:
                              trimmed.insert(0, (start, trimmed[0][1]))
                          if trimmed[-1][0] != end:
                              trimmed.append((end, trimmed[-1][1]))
                          return trimmed
                          
                      # test our function
                      trimmedStock0  = trimToRegion(rawStocks[0], start, end)
                      # the first 5 records of stock 0
                      print(trimmedStock0[:5])
                      
                      print("\n")
                      # the last 5 records of stock 0
                      print(trimmedStock0[-5:])
                      
                      assert(trimmedStock0[0][0] == start), "the first record must contain the price in the first day of time interval"
                      assert(trimmedStock0[-1][0] == end), "the last record must contain the price in the last day of time interval"
                      
                      [(datetime.datetime(2009, 1, 23, 0, 0), 16.254108), (datetime.datetime(2009, 1, 26, 0, 0), 16.470275), (datetime.datetime(2009, 1, 27, 0, 0), 16.703071), (datetime.datetime(2009, 1, 28, 0, 0), 17.975132), (datetime.datetime(2009, 1, 29, 0, 0), 16.478589)]
                      
                      
                      [(datetime.datetime(2014, 1, 16, 0, 0), 35.571935), (datetime.datetime(2014, 1, 17, 0, 0), 35.552912), (datetime.datetime(2014, 1, 21, 0, 0), 35.971404), (datetime.datetime(2014, 1, 22, 0, 0), 36.019202), (datetime.datetime(2014, 1, 23, 0, 0), 35.149312)]
                      

                      Dealing with missing values

                      We expect that we have the price of instruments and factors in each business day. Unfortunately, there are many missing values in our data: this means that we miss data for some days, e.g. we have data for the Monday of a certain week, but not for the subsequent Tuesday. So, we need a function that helps filling these missing values.

                      Next, we provide to you the function to fill missing value: read it carefully!

                      In [16]:
                      def fillInHistory(history, start, end):
                          curr = history
                          filled = []
                          idx = 0
                          curDate = start
                          numEntries = len(history)
                          while curDate < end:
                              
                              # if the next entry is in the same day
                              # or the next entry is at the weekend
                              # but the curDate has already skipped it and moved to the next monday
                              # (only in that case, curr[idx + 1][0] < curDate )
                              # then move to the next entry
                              while idx + 1 < numEntries and curr[idx + 1][0] == curDate:
                                  idx +=1
                      
                              # only add the last value of instrument in a single day
                              # check curDate is weekday or not
                              # 0: Monday -> 5: Saturday, 6: Sunday
                              if curDate.weekday() < 5:
                                  
                                  filled.append((curDate, curr[idx][1]))
                                  # move to the next business day
                                  curDate += timedelta(days=1)
                              
                              # skip the weekends
                              #it is better here to be an if statement than an elif
                              #now we quickly skip Sundays and Sundays without having to wait for the next iteration
                              if curDate.weekday() >= 5:
                                  # if curDate is Sat, skip 2 days, otherwise, skip 1 day
                                  curDate += timedelta(days=(7-curDate.weekday()))
                      
                          return filled
                      

                      Question 4.5

                      Trim data of stocks and factors into the given time interval.
                      In [17]:
                      # trim into a specific time region
                      # and fill up the missing values
                      stocks = list(map(lambda stock: \
                                  fillInHistory(
                                      trimToRegion(stock, start, end), 
                                  start, end), 
                              rawStocks))
                      
                      
                      
                      # merge two factors, trim each factor into a time region
                      # and fill up the missing values
                      allfactors = factors1 + factors2
                      factors = list(map( lambda factor: \
                                  fillInHistory(
                                      trimToRegion(factor, start,end),
                                  start,end), 
                                  allfactors
                                  ))
                                  
                      # test our code
                      print("the first 5 records of stock 0:", stocks[0][:5], "\n")
                      print("the last 5 records of stock 0:", stocks[0][-5:], "\n")
                      print("the first 5 records of factor 0:", factors[0][:5], "\n")
                      print("the first 5 records of factor 0:", factors[0][-5:], "\n")
                      
                      the first 5 records of stock 0: [(datetime.datetime(2009, 1, 23, 0, 0), 16.254108), (datetime.datetime(2009, 1, 26, 0, 0), 16.470275), (datetime.datetime(2009, 1, 27, 0, 0), 16.703071), (datetime.datetime(2009, 1, 28, 0, 0), 17.975132), (datetime.datetime(2009, 1, 29, 0, 0), 16.478589)] 
                      
                      the last 5 records of stock 0: [(datetime.datetime(2014, 1, 16, 0, 0), 35.571935), (datetime.datetime(2014, 1, 17, 0, 0), 35.552912), (datetime.datetime(2014, 1, 20, 0, 0), 35.552912), (datetime.datetime(2014, 1, 21, 0, 0), 35.971404), (datetime.datetime(2014, 1, 22, 0, 0), 36.019202)] 
                      
                      the first 5 records of factor 0: [(datetime.datetime(2009, 1, 23, 0, 0), 46.47), (datetime.datetime(2009, 1, 26, 0, 0), 45.73), (datetime.datetime(2009, 1, 27, 0, 0), 41.58), (datetime.datetime(2009, 1, 28, 0, 0), 42.16), (datetime.datetime(2009, 1, 29, 0, 0), 41.44)] 
                      
                      the first 5 records of factor 0: [(datetime.datetime(2014, 1, 16, 0, 0), 93.96), (datetime.datetime(2014, 1, 17, 0, 0), 94.37), (datetime.datetime(2014, 1, 20, 0, 0), 93.93), (datetime.datetime(2014, 1, 21, 0, 0), 94.99), (datetime.datetime(2014, 1, 22, 0, 0), 96.73)] 
                      
                      
                      Some stock and factor values increase and some decrease generally. It would be a good idea to plot a frequency distribution and try to fit a known distribution. We will do that when we calculate the returns over a period of two weeks later in this notebook.

                      Recall that Value at Risk (VaR) deals with losses over a particular time horizon. We are not concerned with the absolute prices of instruments, but how those prices change over a given period of time. In our project, we will set that length to two weeks: we use the sliding window method to transform time series of prices into an overlapping sequence of price change over two-week intervals.

                      The figure below illustrates this process. The returns of market factors after each two-week interval is calculated in the very same way.

                      In [18]:
                      def buildWindow(seq, k=2):
                          "Returns a sliding window (of width k) over data from iterable data structures"
                          "   s -> (s0,s1,...s[k-1]), (s1,s2,...,sk), ...                   "
                          it = iter(seq)
                          result = tuple(islice(it, k))
                          if len(result) == k:
                              yield result  
                          for elem in it:
                              result = result[1:] + (elem,)
                              yield result
                      

                      Question 4.6

                      Compute the returns of the stocks after each two-week time window.
                      In [19]:
                      def calculateReturn(window):
                          # return the change of value after two weeks
                          return window[-1][1] - window[0][1]
                      
                      def twoWeekReturns(history):
                          # we use 10 instead of 14 to define the window
                          # because financial data does not include weekends
                          return [calculateReturn(entry) for entry in buildWindow(history, 10)]
                      
                      stocksReturns = list(map(twoWeekReturns, stocks))
                      factorsReturns = list(map(twoWeekReturns, factors))
                      
                      # test our functions
                      print("the first 5 returns of stock 0:", stocksReturns[0][:5])
                      print("the last 5 returns of stock 0:", stocksReturns[0][-5:])
                      print("the first 5 returns of factor 0:", factorsReturns[0][:5])
                      print("the last 5 returns of factor 0:", factorsReturns[0][-5:])
                      
                      the first 5 returns of stock 0: [0.9062390000000029, 0.7233289999999997, 0.24942300000000017, -1.9205629999999978, 0.11639800000000022]
                      the last 5 returns of stock 0: [-1.2459659999999957, -0.6562720000000013, -0.7608960000000025, 0.2663129999999967, 0.06682099999999735]
                      the first 5 returns of factor 0: [-5.299999999999997, -5.559999999999995, -2.019999999999996, -4.609999999999999, -5.5]
                      the last 5 returns of factor 0: [0.0, 0.9399999999999977, 0.2600000000000051, 2.6599999999999966, 5.070000000000007]
                      
                      Let's plot the stocks' returns.
                      In [110]:
                      #Let's plot the stocks returns
                      
                      
                      plt.figure(figsize=(70,140))
                      for i in range(len(stocksReturns)):
                          plt.subplot(10,3,i+1)
                          plt.xticks(fontsize=30)
                          plt.yticks(fontsize=30)
                          plt.plot(stocksReturns[i])
                          plt.title("{}".format(stock_names_small[i]),fontsize=40)
                      
                      Let's plot their frequency distributions and try to fit a normal distribution to them.
                      In [117]:
                      plt.figure(figsize=(70,140))
                      for i in range(len(stocksReturns)):
                          plt.subplot(10,3,i+1)
                          bins=30 #define number of bins for histogram
                          #Plot the histogram
                          plt.hist(stocksReturns[i],bins, normed=1, facecolor='green', alpha=0.75)
                      
                          # Plot the Normal Distribution
                          (mu, std) = norm.fit(stocksReturns[i])
                          xmin, xmax = plt.xlim()
                          x = np.linspace(xmin, xmax, 100)
                          p = norm.pdf(x, mu, std)
                          plt.plot(x, p, 'k', linewidth=6,color='r',linestyle='--',label="Fitted Normal Distribution")
                          plt.legend(prop={'size': 30})
                          plt.xticks(fontsize=30)
                          plt.yticks(fontsize=30)
                          plt.title("{}".format(stock_names_small[i]),fontsize=40)
                      
                      We can see that stock returns remind us of normal distributions. There exists a value which is predominant and other values that occur quite less. However,the distributions might not be 100% normal,instead some have fat tails.
                      To assess that,let's draw a Q-Q plot for each stock.
                      To simply put it, a Q-Q plot compares the similarity of two distributions by plotting their quantiles.
                      In [54]:
                      plt.figure(figsize=(70,140))
                      for i in range(len(stocksReturns)):
                          plt.subplot(10,3,i+1)
                          plt.xticks(fontsize=30)
                          plt.yticks(fontsize=30)
                          stats.probplot(stocksReturns[i], dist="norm", plot=pylab)
                          plt.title("Q-Q Plot of {}".format(stock_names_small[i]),fontsize=40)
                      
                      We can see that the normal distribution is good for modeling stock returns except at the tails of the distribution.

                      Alright! Now we have data that is properly aligned to start the training process: stocks' returns and factors' returns, per time windows of two weeks. Next, we will apply the MCS method.

                      5.5. Summary guidelines to apply the MCS method on the data we prepared

                      Next, we overview the steps that you have to follow to build a model of your data, and then use Monte Carlo simulations to produce output distributions:

                      • Step 1: Defining the relationship between the market factors and the instrument's returns. This relationship takes the form of a model fitted to historical data.
                      • Step 2: Defining the distributions for the market conditions (particularly, the returns of factors) that are straightforward to sample from. These distributions are fitted to historical data.
                      • Step 3: Generate the data for each trial of a Monte Carlo run: this amount to generating the random values for market conditions along with these distributions.
                      • Step 4: For each trial, from the above values of market conditions, and using the relationship built in step 1, we calculate the return for each instrument and the total return. We use the returns to define an empirical distribution over losses. This means that, if we run 100 trials and want to estimate the 5% VaR, we would choose it as the loss from the trial with the fifth greatest loss.
                      • Step 5: Evaluating the result

                      5.6. Applying MCS

                      Step 1: Defining relationship between market factors and instrument's returns

                      In our simulation, we will use a simple linear model. By our definition of return, a factor return is a change in the value of a market factor over a particular time period, e.g. if the value of the S&P 500 moves from 2000 to 2100 over a time interval, its return would be 100.

                      A vector that contains the return of 4 market factors is called a market factor vector. Generally, instead of using this vector as features, we derive a set of features from simple transformation of it. In particular, a vector of 4 values is transformed into a vector of length $m$ by function $F$. In the simplest case $F(v) = v$.

                      Denote $v_t$ the market factor vector, and $f_t$ the transformed features of $v_t$ at time $t$.

                      $f_{tj}$ is the value of feature $j$ in $f_t$.

                      Denote $r_{it}$ the return of instrument $i$ at time $t$ and $c_i$ the intercept term of instrument $i$.

                      We will use a simple linear function to calculate $r_{it}$ from $f_t$:

                      $$ r_{it} = c_i + \sum_{j=1}^{m}{w_{ij}*f_{tj}} $$

                      where $w_{ij}$ is the weight of feature $j$ for instrument $i$.

                      All that above means that given a market factor vector, we have to apply featurization and then use the result as a surrogate for calculating the return of the instruments, using the above linear function.

                      There are two questions that we should consider: how we apply featurization to a factor vector? and how to pick values for $w_{ij}$?

                      How we apply featurization to a factor vector? In fact, the instruments' returns may be non-linear functions of the factor returns. So, we should not use factor returns as features in the above linear function. Instead, we transform them into a set of features with different size. In this Notebook, we can include some additional features in our model that we derive from non-linear transformations of the factor returns. We will try adding two more features for each factor return: its square and its square root values. So, we can still assume that our model is a linear model in the sense that the response variable is a linear function of the new features. Note that the particular feature transformation described here is meant to be an illustrative example of some of the options that are available: it shouldn't be considered as the state of the art in predictive financial modeling!!.

                      How to pick values for $w_{ij}$?

                      For all the market factor vectors in our historical data, we transform them to feature vectors. Now, we have feature vectors in many two-week intervals and the corresponding instrument's returns in these intervals. We can use Ordinary Least Square (OLS) regression model to estimate the weights for each instrument such that our linear function can fit to the data. The parameters for OLS function are:

                      • x: The collection of columns where each column is the value of a feature in many two-week interval
                      • y: The return of an instrument in the corresponding time interval of x.

                      The figure below shows the basic idea of the process to build a statistical model for predicting the returns of stock X.

                      Question 5

                      Question 5.1

                      Currently, our data is in form of: $$ factorsReturns= \begin{bmatrix} r_{00} & r_{01} & r_{02} & ... & r_{0k} \\ r_{10} & r_{11} & r_{12} & ... & r_{1k} \\ ... & ... & ... & ... & ... \\ r_{n0} & r_{n1} & r_{n2} & ... & r_{nk}\\ \end{bmatrix} $$
                        $$ stocksReturns= \begin{bmatrix} s_{00} & s_{01} & s_{02} & ... & s_{0k} \\ s_{10} & s_{11} & s_{12} & ... & s_{1k} \\ ... & ... & ... & ... & ... \\ s_{n0} & s_{n1} & s_{n2} & ... & s_{nk}\\ \end{bmatrix} $$
                          Where, $r_{ij}$ is the return of factor $i^{th}$ in time window $j^{th}$, $k$ is the number of time windows, and $n$ is the number of factors. A similar definition goes for $s_{ij}$.
                            In order to use OLS, the parameter must be in form of:
                              $$ x=factorsReturns^T = \begin{bmatrix} r_{00} & r_{10} & ... & r_{n0} \\ r_{01} & r_{11} & ... & r_{n1} \\ r_{02} & r_{12} & ... & r_{n2}\\ ... & ... & ... & ... \\ r_{0k} & r_{1k} & ... & r_{nk}\\ \end{bmatrix} $$
                                Whereas, $y$ can be any row in `stocksReturns`.
                                  So, we need a function to transpose a matrix. Write a function named `transpose` to do just that.
                                  In [20]:
                                  def transpose(matrix):
                                      matrix_trans=[] 
                                      matrix_t=[[matrix[i][j] for i in range(len(matrix))]for j in range(len(matrix[0]))]
                                      return matrix_t
                                      
                                  # test function
                                  assert (transpose([[1,2,3], [4,5,6], [7,8,9]]) == [[1, 4, 7], [2, 5, 8], [3, 6, 9]]), "Function transpose runs incorrectly"
                                  

                                  Question 5.2

                                  Write a function named `featurize` that takes a list factor's returns $[x_1, x_2,...,x_k]$ and transform it into a new list of features $[u_1,u_2,..,u_k, v_1, v_2,..,v_k, x_1,x_2,...,x_k]$.
                                    Where, $u_i$ = $\left\{ \begin{array}{ll} x_i^2 & \mbox{if } x_i \geq 0 \\ -x_i^2 & \mbox{if } x_i < 0 \end{array} \right. $
                                      and $v_i$ = $\left\{ \begin{array}{ll} \sqrt{x_i} & \mbox{if } x_i \geq 0 \\ -\sqrt{x_i} & \mbox{if } x_i < 0 \end{array} \right. $
                                      In [21]:
                                      def featurize(factorReturns):
                                          squaredReturns = [w**2 if w>=0 else -w**2 for w in factorReturns]
                                          squareRootedReturns = [np.sqrt(w) if w>=0 else -np.sqrt(-w) for w in factorReturns]
                                          # concat new features
                                          return squaredReturns + squareRootedReturns + factorReturns
                                      
                                      # test our function
                                      assert (featurize([4, -9, 25]) == [16, -81, 625, 2, -3, 5, 4, -9, 25]), "Function runs incorrectly"
                                      

                                      Question 5.3

                                      Using OLS, estimate the weights for each feature on each stock. What is the shape of `weights` (size of each dimension)? Explain it.
                                      In [22]:
                                      def estimateParams(y, x):
                                          return sm.OLS(y, x).fit().params
                                      
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize,factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      
                                      # estimate weights
                                      weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
                                      
                                      #print("weights:", weights)
                                      print("size of weights = ",np.array(weights).shape)
                                      print("size of stockReturns = ",np.array(stocksReturns).shape)
                                      print("size of factor_columns = ",np.array(factor_columns).shape)
                                      
                                      size of weights =  (29, 13)
                                      size of stockReturns =  (29, 1295)
                                      size of factor_columns =  (1295, 13)
                                      
                                      The weights is the multiplication of the 2 matrices stocksReturns and factor_columns.
                                      It has 29 rows corresponding to 29 returns of different stocks , and 13 columns, 12 of them represent the features (4 features x 3 transformations (square,square root ,identity) =12) and one feature which is an added constant.

                                      Step 2: Defining the distributions for the market conditions

                                      Since we cannot define the distributions for the market factors directly, we can only approximate their distribution. The best way to do that, is plotting their value. However, these values may fluctuate quite a lot.

                                      Next, we show how to use the Kernel density estimation (KDE) technique to approximate such distributions. In brief, kernel density estimation is a way of smoothing out a histogram: this is achieved by assigning (or centering) a probability distribution (usually a normal distribution) to each data point, and then summing. So, a set of two-week-return samples would result in a large number of "super-imposed" normal distributions, each with a different mean.

                                      To estimate the probability density at a given point, KDE evaluates the PDFs of all the normal distributions at that point and takes their average. The smoothness of a kernel density plot depends on its bandwidth, and the standard deviation of each of the normal distributions. For a brief introduction on KDE, please refer to this link.

                                      In [104]:
                                      factor_names=["Crude Oil","Treasury Bond","GSPC","IXIC"]
                                      
                                      In [105]:
                                      def plotDistribution(samples,title):
                                          vmin = min(samples)
                                          vmax = max(samples)
                                          stddev = np.std(samples)
                                          
                                          domain = np.arange(vmin, vmax, (vmax-vmin)/100)
                                          
                                          # a simple heuristic to select bandwidth
                                          bandwidth = 1.06 * stddev * pow(len(samples), -.2)
                                          
                                          # estimate density
                                          kde = KDEUnivariate(samples)
                                          kde.fit(bw=bandwidth)
                                          density = kde.evaluate(domain)
                                          
                                          # plot KDE
                                          plt.plot(domain, density)
                                          plt.title("KDE of {}".format(title))
                                          
                                          # Plot the fitted Normal Distribution
                                          (mu, std) = norm.fit(samples)
                                          xmin, xmax = plt.xlim()
                                          x = np.linspace(xmin, xmax, 100)
                                          p = norm.pdf(x, mu, std)
                                          plt.plot(x, p, 'k', linewidth=2,color='r',linestyle='--',label="Fitted Normal Distribution")
                                          plt.legend()
                                          plt.show()
                                      
                                      plotDistribution(factorsReturns[0],factor_names[0])
                                      plotDistribution(factorsReturns[1],factor_names[1])
                                      plotDistribution(factorsReturns[2],factor_names[2])
                                      plotDistribution(factorsReturns[3],factor_names[3])
                                      
                                      Let's draw a Q-Q plot to see if the market factors fit well a normal distribution.
                                      In [151]:
                                      plt.figure(figsize=(70,70))
                                      for i in range(len(factorsReturns)):
                                          plt.subplot(2,2,i+1)
                                          plt.xticks(fontsize=30)
                                          plt.yticks(fontsize=30)
                                          stats.probplot(factorsReturns[i], dist="norm", plot=pylab)
                                          plt.title("Q-Q Plot of {}".format(factor_names[i]),fontsize=40)
                                      
                                      We can see that a normal distribution is an acceptable model,but will it be enough to obtain a good model at the end? We see that later.

                                      For the sake of simplicity, we can say that our smoothed versions of the returns of each factor can be represented quite well by a normal distribution. Of course, more exotic distributions, perhaps with fatter tails, could fit more closely the data, but it is outside the scope of this Notebook to proceed in this way.

                                      Now, the simplest way to sample factors returns is to use a normal distribution for each of the factors, and sample from these distributions independently. However, this approach ignores the fact that market factors are often correlated. For example, when the price of crude oil is down, the price of treasury bonds is down too. We can check our data to verify about the correlation.

                                      Question 6

                                      Question 6.1

                                      Calculate the correlation between market factors and explain the result.

                                      HINT
                                      function np.corrcoef might be useful.

                                      In [106]:
                                      correlation = np.corrcoef(factorsReturns)
                                      
                                      Let's plot a heatmap!
                                      In [62]:
                                      cmap = sns.diverging_palette(220, 10,as_cmap=True)
                                      mask = np.zeros_like(correlation, dtype=np.bool)
                                      mask[np.triu_indices_from(mask)] = True
                                      plt.figure(figsize=(10,10))
                                      # Draw the heatmap with the mask and correct aspect ratio
                                      sns.heatmap(correlation,vmin=0.2,vmax=0.95,cmap=cmap, center=0,annot=True,
                                                  square=True, linewidths=.5, cbar_kws={"shrink": .5},xticklabels=factor_names,yticklabels=factor_names)
                                      plt.xticks(rotation=0)
                                      plt.yticks(rotation=0)
                                      plt.title("Correlations between Market Factors")
                                      plt.show()
                                      
                                      We have positive correlations.
                                      The correlation matrix has non-zero elements on the diagonal. So we can't consider each market factor as a univariate distribution and multiply the results together. We need to use multivariate distributions.

                                      The multivariate normal distribution can help here by taking the correlation information between the factors into account. Each sample from a multivariate normal distribution can be thought of as a vector. Given values for all of the dimensions but one, the distribution of values along that dimension is normal. But, in their joint distribution, the variables are not independent.

                                      For this use case, we can write:

                                      $$ \left(\begin{array}{c}f_{1}\\f_{2}\\f_{3}\\f_{4} \end{array}\right) \sim N \left[ \left( \begin{array}{c} \mu_1\\ \mu_2 \\ \mu_3 \\ \mu_4 \end{array} \right), \left( \begin{array}{cccc} \sigma^2_1 & \rho_{12} \sigma_1\sigma_2 & \rho_{13} \sigma_1\sigma_3 & \rho_{14} \sigma_1\sigma_4 \\ \rho_{12}\sigma_2\sigma_1 & \sigma^2_2 & \rho_{23} \sigma_2\sigma_3 & \rho_{24} \sigma_2\sigma_4\\ \rho_{13} \sigma_3\sigma_1 & \rho_{23} \sigma_3\sigma_2 & \sigma^2_3 & \rho_{34} \sigma_3\sigma_4 \\ \rho_{14} \sigma_4\sigma_1 & \rho_{24} \sigma_4\sigma_2 & \rho_{34} \sigma_3\sigma_4 & \sigma_4^2 \\ \end{array} \right) \right] $$

                                      Or,

                                      $$ f_t \sim N(\mu, \sum) $$

                                      Where $f_1$, $f_2$, $f_3$ and $f_4$ are the market factors, $\sigma_i$ is the standard deviation of factor $i$, $\mu$ is a vector of the empirical means of the returns of the factors and $\sum$ is the empirical covariance matrix of the returns of the factors.

                                      The multivariate normal is parameterized with a mean along each dimension and a matrix describing the covariance between each pair of dimensions. When the covariance matrix is diagonal, the multivariate normal reduces to sampling along each dimension independently, but placing non-zero values in the off-diagonals helps capture the relationships between variables. Whenever having the mean of this multivariate normal distribution and its covariance matrix, we can generate the sample values for market factors.

                                      Next, we will calculate the mean and the covariance matrix of this multivariate normal distribution from the historical data.

                                      Question 6.2

                                      Calculate the covariance matrix $\sum$ and the means $\mu$ of factors' returns then generate a random vector of factors return that follows a multivariate normal distribution $\sim N(\mu, \sum)$

                                      HINT
                                      Function np.cov can help calculating covariance matrix. Function np.random.multivariate_normal(<mean>, <cov>) is often used for generating samples.

                                      In [23]:
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(w)/len(w) for w in factorsReturns]
                                      sample = np.random.multivariate_normal(factorMeans,factorCov)
                                      print("Covariance Matrix = ",factorCov)
                                      print("\n")
                                      print("Mean= ",factorMeans)
                                      print("\n")
                                      print("Sample = ",sample)
                                      print("\n")
                                      
                                      Covariance Matrix =  [[2.03712313e+01 2.63376804e-01 7.77281497e+01 1.72733934e+02]
                                       [2.63376804e-01 2.22650671e-02 3.18895549e+00 7.29816937e+00]
                                       [7.77281497e+01 3.18895549e+00 1.31938031e+03 2.88775882e+03]
                                       [1.72733934e+02 7.29816937e+00 2.88775882e+03 6.96989443e+03]]
                                      
                                      
                                      Mean=  [0.35908880308880364, -0.0027590733590733448, 6.953868592277998, 18.7092272903475]
                                      
                                      
                                      Sample =  [  2.93090118  -0.23458834  89.09143689 170.38070307]
                                      
                                      
                                      

                                      Step 3&4: Generating samples, running simulation and calculating the VaR

                                      We define some functions that helps us calculating VaR 5%. You will see that the functions below are pretty complicated! This is why we provide a solution for you: however, study them well!!

                                      The basic idea of calculating VaR 5% is that we need to find a value such that only 5% of the losses are bigger than it. That means the 5th percentile of the losses should be VaR 5%.

                                      VaR can sometimes be problematic though, since it does give any information on the extent of the losses which can exceed the VaR estimate. CVar is an extension of VaR that is introduced to deal with this problem. Indeed, CVaR measures the expected value of the loss in those cases where VaR estimate has been exceeded.

                                      In [24]:
                                      def fivePercentVaR(trials):
                                          numTrials = trials.count()
                                          topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
                                          return topLosses[-1]
                                      
                                      # an extension of VaR
                                      def fivePercentCVaR(trials):
                                          numTrials = trials.count()
                                          topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
                                          return sum(topLosses)/len(topLosses)
                                      
                                      def bootstrappedConfidenceInterval(
                                            trials, computeStatisticFunction,
                                            numResamples, pValue):
                                          stats = []
                                          for i in range(0, numResamples):
                                              resample = trials.sample(True, 1.0)
                                              stats.append(computeStatisticFunction(resample))
                                          stats=sorted(stats)#######this was fixed
                                          lowerIndex = int(numResamples * pValue / 2 - 1)
                                          upperIndex = int(np.ceil(numResamples * (1 - pValue / 2)))
                                          return (stats[lowerIndex], stats[upperIndex])
                                      

                                      Next, we will run the Monte Carlo simulation 10,000 times, in parallel using Spark. Since your cluster has 12 cores (two Spark worker nodes, each with 6 cores), we can set parallelism = 12 to dispatch simulation on these cores, across the two machines (remember, those are not really "physical machines", they are Docker containers running in our infrastructure).

                                      Question 7

                                      Complete the code below to define the simulation process and calculate VaR 5%.
                                      In [25]:
                                      ### RUN SILMULATION
                                      def simulateTrialReturns(numTrials, factorMeans, factorCov, weights):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = list(np.random.multivariate_normal(factorMeans, factorCov))
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize(trialFactorReturns)
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures))
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -28.399840388247267
                                      Conditional Value at Risk(CVaR) 5%: -36.909449762335974
                                      
                                      There is a 5% chance that we will loose more than 28.39 dollars over a two week's period and the expected shortfall or CVaR is 36.9. This gives us expected information about how much the loss would be if et exceeded the VaR.

                                      The value of VaR depends on how many invested stocks and the chosen distribution of random variables. Assume that we get VaR 5% = -2.66, that means that there is a 0.05 probability that the portfolio will fall in value by more than \$2.66 over a two weeks' period if there is no trading. In other words, the loses are less than \$2.66 over two weeks' period with 95% confidence level. When a loss over two weeks is more than \$2.66, we call it **failure** (or **exception**). Informally, because of 5% probability, we expect that there are only $0.05*W$ failures out of total $W$ windows.

                                      Step 5: Evaluating the results using backtesting method

                                      In general, the error in a Monte Carlo simulation should be proportional to 1/sqrt(n), where n is the number of trials. This means, for example, that quadrupling the number of trials should approximately cut the error in half. A good way to check the quality of a result is backtesting on historical data. Backtesting is a statistical procedure where actual losses are compared to the estimated VaR. For instance, if the confidence level used to calculate VaR is 95% (or VaR 5%), we expect only 5 failures over 100 two-week time windows.

                                      The most common test of a VaR model is counting the number of VaR failures, i.e., in how many windows, the losses exceed VaR estimate. If the number of exceptions is less than selected confidence level would indicate, the VaR model overestimates the risk. On the contrary, if there are too many exceptions, the risk is underestimated. However, it's very hard to observe the amount of failures suggested by the confidence level exactly. Therefore, people try to study whether the number of failures is reasonable or not, or will the model be accepted or rejected.

                                      One common test is Kupiec's proportion-of-failures (POF) test. This test considers how the portfolio performed at many historical time intervals and counts the number of times that the losses exceeded the VaR. The null hypothesis is that the VaR is reasonable, and a sufficiently extreme test statistic means that the VaR estimate does not accurately describe the data. The test statistic is computed as:

                                      $$ -2ln\Bigg(\frac{(1-p)^{T-x}p^x}{(1-\frac{x}{T})^{T-x}(\frac{x}{T})^x}\Bigg) $$

                                      where:

                                      $p$ is the quantile-of-loss of the VaR calculation (e.g., in VaR 5%, p=0.05),

                                      $x$ (the number of failures) is the number of historical intervals over which the losses exceeded the VaR

                                      $T$ is the total number of historical intervals considered

                                      Or we can expand out the log for better numerical stability:

                                      $$ \begin{equation} -2\Big((T-x)ln(1-p)+x*ln(p)-(T-x)ln(1-\frac{x}{T})+x*ln(\frac{x}{T})\Big) \end{equation} $$

                                      If we assume the null hypothesis that the VaR is reasonable, then this test statistic is drawn from a chi-squared distribution with a single degree of freedom. By using Chi-squared distribution, we can find the p-value accompanying our test statistic value. If p-value exceeds the critical value of the Chi-squared distribution, we do have sufficient evidence to reject the null hypothesis that the model is reasonable. Or we can say, in that case, the model is considered as inaccurate.

                                      For example, assume that we calculate VaR 5% (the confidence level of the VaR model is 95%) and get value VaR = 2.26. We also observed 50 exceptions over 500 time windows. Using the formula above, the test statistic p-value is calculated and equal to 8.08. Compared to 3.84, the critical value of Chi-squared distribution with one degree of freedom at probability 5%, the test statistic is larger. So, the model is rejected. The critical values of Chi-squared can be found by following this link. However, in this Notebook, it's not a good idea to find the corresponding critical value by looking in a "messy" table, especially when we need to change the confidence level. Instead, from p-value, we will calculate the probability of the test statistic in Chi-square thanks to some functions in package scipy. If the calculated probability is smaller than the quantile of loss (e.g, 0.05), the model is rejected and vice versa.

                                      Question 8

                                      Question 8.1

                                      Write a function to calculate the number of failures, that is when the losses (in the original data) exceed the VaR.

                                      HINT

                                      • First, we need to calculate the total loss in each 2-week time interval
                                      • If the total loss of a time interval exceeds VaR, then we say that our VaR fails to estimate the risk in that time interval
                                      • Return the number of failures

                                      NOTE
                                      The loss is often having negative value, so, be careful when compare it to VaR.

                                      In [66]:
                                      def countFailures(stocksReturns, valueAtRisk):
                                          failures = 0
                                          # iterate over time intervals
                                          for i in range(0, len(stocksReturns[0])):
                                              # calculate the losses in each time interval
                                              loss = sum(w[i] for w in stocksReturns)
                                              
                                              # if the loss exceeds VaR
                                              if loss<valueAtRisk:
                                                  failures += 1
                                          return failures
                                      

                                      Question 8.2

                                      Write a function named `kupiecTestStatistic` to calculate the test statistic which was described in the above equation.
                                      In [67]:
                                      def kupiecTestStatistic(total, failures, confidenceLevel):
                                          failureRatio = failures/total
                                          logNumer = (total-failures)*np.log(1-confidenceLevel)+failures*np.log(confidenceLevel)
                                          logDenom = (total-failures)*np.log(1-failureRatio)+failures*np.log(failureRatio)
                                          return -2 * (logNumer - logDenom)
                                          
                                      # test the function
                                      assert (round(kupiecTestStatistic(250, 36, 0.1), 2) == 4.80), "function kupiecTestStatistic runs incorrectly"
                                      

                                      Now we can find the p-value accompanying our test statistic value.

                                      In [68]:
                                      def kupiecTestPValue(stocksReturns, valueAtRisk, confidenceLevel):
                                          failures = countFailures(stocksReturns, valueAtRisk)
                                          N = len(stocksReturns)
                                          print("num failures:", failures)
                                          total = len(stocksReturns[0])
                                          testStatistic = kupiecTestStatistic(total, failures, confidenceLevel)
                                          #return 1 - stats.chi2.cdf(testStatistic, 1.0)
                                          return stats.chi2.sf(testStatistic, 1.0)
                                      
                                      In [39]:
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      VaR confidence interval:  (-27.815945855897404, -26.17102408375137)
                                      CVaR confidence interval:  (-36.06128460891518, -34.00304701834569)
                                      num failures: 135
                                      Kupiec test p-value:  3.487062859076128e-15
                                      

                                      Question 8.3

                                      Discuss the results you have obtained
                                      The VaR Condifdence Interval is between -27.81 and -26.17 . This means that over a period of two weeks, there is a 5% chance that we will loose more than or equal to the values between 26.17 and 27.81 . Our expected loss or shortfall is between 34.00 and 36.06 dollars .
                                      The Kupiec test p-value is less than 0.05, so the model used is not good and is rejected.

                                      Question 9

                                      Assume that we invest in more than 100 stocks. Use the same market factors as for the previous questions to estimate VaR by running MCS, then validate your result. What is the main observation you have, once you answer this question? When you plan to invest in more instruments, how is your ability to predict the risk going to be affected?
                                      In [40]:
                                      # select path of all stock data files in "stock_folder"
                                      files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                                      
                                      # assume that we invest only the first 35 stocks (for faster computation)
                                      files = files[:150]
                                      
                                      # read each line in each file, convert it into the format: (date, value)
                                      rawStocks = [process_stock_file(f) for f in files]
                                      
                                      # select only instruments which have more than 5 years of history
                                      # Note: the number of business days in a year is 260
                                      number_of_years = 5
                                      stock_names_small=[stock_names[i] for i in range(len(files)) if(len(rawStocks[i])>260*5) ]
                                      rawStocks = list(filter(lambda instrument:len(instrument)>260*5 , rawStocks))
                                      
                                      # For testing, print the first 5 entry of the first stock
                                      print(rawStocks[0][:5])
                                      
                                      [(datetime.datetime(1997, 8, 14, 0, 0), 6.011436), (datetime.datetime(1997, 8, 15, 0, 0), 6.473854), (datetime.datetime(1997, 8, 18, 0, 0), 7.47576), (datetime.datetime(1997, 8, 19, 0, 0), 7.39869), (datetime.datetime(1997, 8, 20, 0, 0), 7.39869)]
                                      
                                      In [41]:
                                      len(rawStocks)
                                      
                                      Out[41]:
                                      115
                                      The number of stocks decreased form 150 to 115.
                                      Let's do some time alignment.
                                      In [42]:
                                      # trim into a specific time region
                                      # and fill up the missing values
                                      stocks = list(map(lambda stock: \
                                                  fillInHistory(
                                                      trimToRegion(stock, start, end), 
                                                  start, end), 
                                              rawStocks))
                                      
                                      
                                      
                                      # merge two factors, trim each factor into a time region
                                      # and fill up the missing values
                                      allfactors = factors1 + factors2
                                      factors = list(map( lambda factor: \
                                                  fillInHistory(
                                                      trimToRegion(factor, start,end),
                                                  start,end), 
                                                  allfactors
                                                  ))
                                                  
                                      # test our code
                                      print("the first 5 records of stock 0:", stocks[0][:5], "\n")
                                      print("the last 5 records of stock 0:", stocks[0][-5:], "\n")
                                      print("the first 5 records of factor 0:", factors[0][:5], "\n")
                                      print("the first 5 records of factor 0:", factors[0][-5:], "\n")
                                      
                                      the first 5 records of stock 0: [(datetime.datetime(2009, 1, 23, 0, 0), 16.254108), (datetime.datetime(2009, 1, 26, 0, 0), 16.470275), (datetime.datetime(2009, 1, 27, 0, 0), 16.703071), (datetime.datetime(2009, 1, 28, 0, 0), 17.975132), (datetime.datetime(2009, 1, 29, 0, 0), 16.478589)] 
                                      
                                      the last 5 records of stock 0: [(datetime.datetime(2014, 1, 16, 0, 0), 35.571935), (datetime.datetime(2014, 1, 17, 0, 0), 35.552912), (datetime.datetime(2014, 1, 20, 0, 0), 35.552912), (datetime.datetime(2014, 1, 21, 0, 0), 35.971404), (datetime.datetime(2014, 1, 22, 0, 0), 36.019202)] 
                                      
                                      the first 5 records of factor 0: [(datetime.datetime(2009, 1, 23, 0, 0), 46.47), (datetime.datetime(2009, 1, 26, 0, 0), 45.73), (datetime.datetime(2009, 1, 27, 0, 0), 41.58), (datetime.datetime(2009, 1, 28, 0, 0), 42.16), (datetime.datetime(2009, 1, 29, 0, 0), 41.44)] 
                                      
                                      the first 5 records of factor 0: [(datetime.datetime(2014, 1, 16, 0, 0), 93.96), (datetime.datetime(2014, 1, 17, 0, 0), 94.37), (datetime.datetime(2014, 1, 20, 0, 0), 93.93), (datetime.datetime(2014, 1, 21, 0, 0), 94.99), (datetime.datetime(2014, 1, 22, 0, 0), 96.73)] 
                                      
                                      
                                      Let's estimate the weights.
                                      In [43]:
                                      stocksReturns = list(map(twoWeekReturns, stocks))
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize,factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      #print(factor_columns[0])
                                      # estimate weights
                                      weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
                                      
                                      #print("weights:", weights)
                                      print("size of weights= ",np.array(weights).shape)
                                      print("size of stockReturns= ",np.array(stocksReturns).shape)
                                      print("size of factor_columns= ",np.array(factor_columns).shape)
                                      
                                      size of weights=  (115, 13)
                                      size of stockReturns=  (115, 1295)
                                      size of factor_columns=  (1295, 13)
                                      
                                      Let's do the calculations.
                                      In [44]:
                                      #Covariance and mean 
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(w)/len(w) for w in factorsReturns]
                                      
                                      In [178]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -410.2557300086793
                                      Conditional Value at Risk(CVaR) 5%: -671.3433276780536
                                      
                                      There is a 5% chance that we loose more than 410 dollars over a period of two weeks.
                                      In [134]:
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      VaR confidence interval:  (-436.73030020881976, -388.5726600229409)
                                      CVaR confidence interval:  (-771.6950361060042, -675.7328807003033)
                                      num failures: 211
                                      Kupiec test p-value:  1.1319704652398758e-50
                                      
                                      The VaR Condifdence Interval is between -436.8 and -388.6 . This means that over a period of two weeks, there is a 5% chance that we will loose more than or equal to the values between 388.6 and 436.8 dollars . Our expected loss or shortfall is between 675.7 and 771.7 .
                                      The Kupiec test p-value is less than 0.05, so the model used is not good and is rejected.
                                      We can see that the values for the VaR and CVaR intervals increased. This is trivial as now we have more stocks that have higher risks of loosing more money.
                                      We see that the model performed worse on the Kupiec test,but however this is not always the case as we will see later.

                                      Question 10

                                      In the previous questions, we used the normal distributions to sample the factors returns. Try to study how results vary when selecting other probability distributions: our goal is to improve the result of our MCS.
                                      Let's try using a t-Student distribution.

                                      Using 150 Stocks

                                      t-Student Distribution

                                      In [46]:
                                      #function that samples form a multivaruate t-Student distribution ,featurizes the sample,and calculate the returns
                                      
                                      def Student_simulateTrialReturns(numTrials, factorMeans, factorCov, weights,degree):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = mul.multivariate_t_rvs(factorMeans,factorCov,df=degree,n=1).tolist()[0]
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize(trialFactorReturns)
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures).tolist())
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      In [185]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in (np.arange(30)/2+1.5):
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 1.500000
                                      Value at Risk(VaR) 5%: -3687.906649963236
                                      Conditional Value at Risk(CVaR) 5%: -677771.2338996766
                                      VaR confidence interval:  (-4703.1370682942, -3329.72169287651)
                                      CVaR confidence interval:  (-1553073.7068730192, -94138.01996142662)
                                      num failures: 36
                                      Kupiec test p-value:  6.672125661249312e-05
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.000000
                                      Value at Risk(VaR) 5%: -2172.3826503812666
                                      Conditional Value at Risk(CVaR) 5%: -45992.27057661668
                                      VaR confidence interval:  (-2417.321225190685, -1994.7903388884079)
                                      CVaR confidence interval:  (-98664.2746941403, -16974.59108053889)
                                      num failures: 42
                                      Kupiec test p-value:  0.001991292838160743
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.500000
                                      Value at Risk(VaR) 5%: -1526.5760730334287
                                      Conditional Value at Risk(CVaR) 5%: -12083.619527996681
                                      VaR confidence interval:  (-1703.142336302026, -1397.7601551705518)
                                      CVaR confidence interval:  (-22889.512106037273, -6412.68920242322)
                                      num failures: 54
                                      Kupiec test p-value:  0.15872837799409656
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.000000
                                      Value at Risk(VaR) 5%: -1092.9132158531768
                                      Conditional Value at Risk(CVaR) 5%: -5228.953663949226
                                      VaR confidence interval:  (-1198.0432182301968, -963.1044001364093)
                                      CVaR confidence interval:  (-6433.368662940612, -3907.9209055148713)
                                      num failures: 70
                                      Kupiec test p-value:  0.5085489050446723
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.500000
                                      Value at Risk(VaR) 5%: -1007.3457993635827
                                      Conditional Value at Risk(CVaR) 5%: -4356.602472544087
                                      VaR confidence interval:  (-1125.3291976445898, -888.6673825498162)
                                      CVaR confidence interval:  (-5544.659506909265, -3146.8514156451915)
                                      num failures: 73
                                      Kupiec test p-value:  0.30216693471488454
                                      **********************************************************************************
                                      t-Student Distribution with degree = 4.000000
                                      Value at Risk(VaR) 5%: -872.0813759251827
                                      Conditional Value at Risk(CVaR) 5%: -2967.9173505955296
                                      VaR confidence interval:  (-938.0146023934583, -818.105368563276)
                                      CVaR confidence interval:  (-3728.9992903074176, -2331.4392378158577)
                                      num failures: 89
                                      Kupiec test p-value:  0.0033552687802331674
                                      **********************************************************************************
                                      t-Student Distribution with degree = 4.500000
                                      Value at Risk(VaR) 5%: -789.9678292243173
                                      Conditional Value at Risk(CVaR) 5%: -2446.156741842787
                                      VaR confidence interval:  (-852.1605674756327, -736.3586995117623)
                                      CVaR confidence interval:  (-3033.184025810503, -2036.7309124046085)
                                      num failures: 101
                                      Kupiec test p-value:  1.8044361602279846e-05
                                      **********************************************************************************
                                      t-Student Distribution with degree = 5.000000
                                      Value at Risk(VaR) 5%: -749.4156162732298
                                      Conditional Value at Risk(CVaR) 5%: -2098.1531521255897
                                      VaR confidence interval:  (-796.3212158622213, -684.2811307779291)
                                      CVaR confidence interval:  (-2498.5755231136873, -1759.431948121885)
                                      num failures: 108
                                      Kupiec test p-value:  4.32481037907427e-07
                                      **********************************************************************************
                                      t-Student Distribution with degree = 5.500000
                                      Value at Risk(VaR) 5%: -677.7477229363326
                                      Conditional Value at Risk(CVaR) 5%: -1644.3229042548617
                                      VaR confidence interval:  (-726.8319053861258, -623.7825279695912)
                                      CVaR confidence interval:  (-1795.4609077081616, -1459.1919106304254)
                                      num failures: 123
                                      Kupiec test p-value:  3.043573162691586e-11
                                      **********************************************************************************
                                      t-Student Distribution with degree = 6.000000
                                      Value at Risk(VaR) 5%: -629.7172236966372
                                      Conditional Value at Risk(CVaR) 5%: -1649.2784947887487
                                      VaR confidence interval:  (-672.0934561800896, -570.2405974868193)
                                      CVaR confidence interval:  (-2045.5178543209977, -1358.4837665603573)
                                      num failures: 129
                                      Kupiec test p-value:  3.789467951540318e-13
                                      **********************************************************************************
                                      t-Student Distribution with degree = 6.500000
                                      Value at Risk(VaR) 5%: -661.8230205119348
                                      Conditional Value at Risk(CVaR) 5%: -1476.7569562177482
                                      VaR confidence interval:  (-712.8379075958337, -616.9835633286336)
                                      CVaR confidence interval:  (-1614.178472048995, -1332.8362852624298)
                                      num failures: 127
                                      Kupiec test p-value:  1.6918092918867929e-12
                                      **********************************************************************************
                                      t-Student Distribution with degree = 7.000000
                                      Value at Risk(VaR) 5%: -636.8496879503667
                                      Conditional Value at Risk(CVaR) 5%: -1432.3947889016579
                                      VaR confidence interval:  (-698.529287704512, -586.5748197249342)
                                      CVaR confidence interval:  (-1566.4874386586234, -1326.634872732528)
                                      num failures: 128
                                      Kupiec test p-value:  8.040809945148969e-13
                                      **********************************************************************************
                                      t-Student Distribution with degree = 7.500000
                                      Value at Risk(VaR) 5%: -583.9015773154936
                                      Conditional Value at Risk(CVaR) 5%: -1263.804562586157
                                      VaR confidence interval:  (-614.4225282436465, -545.2997557458897)
                                      CVaR confidence interval:  (-1403.7260063243834, -1155.610754461442)
                                      num failures: 143
                                      Kupiec test p-value:  4.281714400990231e-18
                                      **********************************************************************************
                                      t-Student Distribution with degree = 8.000000
                                      Value at Risk(VaR) 5%: -547.1577286248161
                                      Conditional Value at Risk(CVaR) 5%: -1163.1010091128107
                                      VaR confidence interval:  (-588.1487667625896, -514.8176705463488)
                                      CVaR confidence interval:  (-1277.1862924676143, -1076.584846093946)
                                      num failures: 153
                                      Kupiec test p-value:  4.926248599357563e-22
                                      **********************************************************************************
                                      t-Student Distribution with degree = 8.500000
                                      Value at Risk(VaR) 5%: -562.8764458659537
                                      Conditional Value at Risk(CVaR) 5%: -1149.2761522829921
                                      VaR confidence interval:  (-594.0601787756336, -521.7773890069362)
                                      CVaR confidence interval:  (-1264.3922433063985, -1039.8741622342177)
                                      num failures: 147
                                      Kupiec test p-value:  1.2449111911587396e-19
                                      **********************************************************************************
                                      t-Student Distribution with degree = 9.000000
                                      Value at Risk(VaR) 5%: -525.8456731919987
                                      Conditional Value at Risk(CVaR) 5%: -1072.966449655092
                                      VaR confidence interval:  (-587.351106665461, -500.2301533221121)
                                      CVaR confidence interval:  (-1153.921142704222, -1015.2235019846199)
                                      num failures: 161
                                      Kupiec test p-value:  2.0519237115526805e-25
                                      **********************************************************************************
                                      t-Student Distribution with degree = 9.500000
                                      Value at Risk(VaR) 5%: -553.8670205165394
                                      Conditional Value at Risk(CVaR) 5%: -1132.310936669693
                                      VaR confidence interval:  (-592.5182403778567, -515.6872123059268)
                                      CVaR confidence interval:  (-1212.103509703711, -1017.2915540362366)
                                      num failures: 151
                                      Kupiec test p-value:  3.2081251037641903e-21
                                      **********************************************************************************
                                      t-Student Distribution with degree = 10.000000
                                      Value at Risk(VaR) 5%: -515.118019525956
                                      Conditional Value at Risk(CVaR) 5%: -1004.4134663508993
                                      VaR confidence interval:  (-556.8056655140896, -488.6810899445241)
                                      CVaR confidence interval:  (-1093.9821753824597, -918.1599798495351)
                                      num failures: 163
                                      Kupiec test p-value:  2.731310077716742e-26
                                      **********************************************************************************
                                      t-Student Distribution with degree = 10.500000
                                      Value at Risk(VaR) 5%: -543.2574619693125
                                      Conditional Value at Risk(CVaR) 5%: -1043.9602405424544
                                      VaR confidence interval:  (-573.6215882276658, -497.24799834822954)
                                      CVaR confidence interval:  (-1119.763390600333, -951.7254151712417)
                                      num failures: 153
                                      Kupiec test p-value:  4.926248599357563e-22
                                      **********************************************************************************
                                      t-Student Distribution with degree = 11.000000
                                      Value at Risk(VaR) 5%: -524.9759637698993
                                      Conditional Value at Risk(CVaR) 5%: -1003.3653529351827
                                      VaR confidence interval:  (-546.3904187425627, -492.500930442811)
                                      CVaR confidence interval:  (-1068.466637248137, -940.0679066579607)
                                      num failures: 161
                                      Kupiec test p-value:  2.0519237115526805e-25
                                      **********************************************************************************
                                      t-Student Distribution with degree = 11.500000
                                      Value at Risk(VaR) 5%: -529.0876244229165
                                      Conditional Value at Risk(CVaR) 5%: -1011.5061234424512
                                      VaR confidence interval:  (-556.5479366846973, -497.48100554457255)
                                      CVaR confidence interval:  (-1084.0293534241514, -918.1455167877855)
                                      num failures: 161
                                      Kupiec test p-value:  2.0519237115526805e-25
                                      **********************************************************************************
                                      t-Student Distribution with degree = 12.000000
                                      Value at Risk(VaR) 5%: -510.50837401358785
                                      Conditional Value at Risk(CVaR) 5%: -1031.3007220101128
                                      VaR confidence interval:  (-541.2326317940997, -484.13056458118257)
                                      CVaR confidence interval:  (-1121.0862245259962, -965.9640275979813)
                                      num failures: 167
                                      Kupiec test p-value:  4.45471564332909e-28
                                      **********************************************************************************
                                      t-Student Distribution with degree = 12.500000
                                      Value at Risk(VaR) 5%: -502.70588711836604
                                      Conditional Value at Risk(CVaR) 5%: -980.9098756457342
                                      VaR confidence interval:  (-531.493851094004, -458.8914517823496)
                                      CVaR confidence interval:  (-1065.1020329587323, -916.7715329049889)
                                      num failures: 168
                                      Kupiec test p-value:  1.5650390629644841e-28
                                      **********************************************************************************
                                      t-Student Distribution with degree = 13.000000
                                      Value at Risk(VaR) 5%: -491.03854047944935
                                      Conditional Value at Risk(CVaR) 5%: -935.7071648623097
                                      VaR confidence interval:  (-525.4675230110204, -467.58384451418425)
                                      CVaR confidence interval:  (-1017.2793144267849, -876.0975386408467)
                                      num failures: 176
                                      Kupiec test p-value:  2.8558136329173634e-32
                                      **********************************************************************************
                                      t-Student Distribution with degree = 13.500000
                                      Value at Risk(VaR) 5%: -475.965715024976
                                      Conditional Value at Risk(CVaR) 5%: -930.6529157496475
                                      VaR confidence interval:  (-510.0120257493305, -451.8964042982009)
                                      CVaR confidence interval:  (-1002.3701918899412, -848.1961079707763)
                                      num failures: 179
                                      Kupiec test p-value:  1.0155804463364489e-33
                                      **********************************************************************************
                                      t-Student Distribution with degree = 14.000000
                                      Value at Risk(VaR) 5%: -472.1174725835242
                                      Conditional Value at Risk(CVaR) 5%: -842.3676707145473
                                      VaR confidence interval:  (-494.97685061988153, -450.9208711476381)
                                      CVaR confidence interval:  (-888.260478350068, -781.360978917425)
                                      num failures: 179
                                      Kupiec test p-value:  1.0155804463364489e-33
                                      **********************************************************************************
                                      t-Student Distribution with degree = 14.500000
                                      Value at Risk(VaR) 5%: -499.00376226066226
                                      Conditional Value at Risk(CVaR) 5%: -931.7004559138841
                                      VaR confidence interval:  (-536.3984586437655, -462.6823363221624)
                                      CVaR confidence interval:  (-1009.3520002394661, -865.9378420570354)
                                      num failures: 169
                                      Kupiec test p-value:  5.461307698181073e-29
                                      **********************************************************************************
                                      t-Student Distribution with degree = 15.000000
                                      Value at Risk(VaR) 5%: -489.81269608199227
                                      Conditional Value at Risk(CVaR) 5%: -882.4477040020491
                                      VaR confidence interval:  (-527.8416404331858, -461.9098488833241)
                                      CVaR confidence interval:  (-944.955121012438, -816.47753036955)
                                      num failures: 176
                                      Kupiec test p-value:  2.8558136329173634e-32
                                      **********************************************************************************
                                      t-Student Distribution with degree = 15.500000
                                      Value at Risk(VaR) 5%: -492.71961915575463
                                      Conditional Value at Risk(CVaR) 5%: -874.9108097610904
                                      VaR confidence interval:  (-522.6070274987819, -467.66680478731786)
                                      CVaR confidence interval:  (-952.4578530152221, -829.6137608759864)
                                      num failures: 175
                                      Kupiec test p-value:  8.572435535937688e-32
                                      **********************************************************************************
                                      t-Student Distribution with degree = 16.000000
                                      Value at Risk(VaR) 5%: -489.8602346440471
                                      Conditional Value at Risk(CVaR) 5%: -902.0282251143717
                                      VaR confidence interval:  (-525.1523621398582, -463.4473753829734)
                                      CVaR confidence interval:  (-961.7793826368352, -836.2359508788925)
                                      num failures: 176
                                      Kupiec test p-value:  2.8558136329173634e-32
                                      **********************************************************************************
                                      
                                      We can see that with degrees 2.5, 3, 3.5,the model passed the test!
                                      Let's try more stuff.

                                      Gaussian Mixture Models (GMM)

                                      Let's try using GMM!
                                      In [188]:
                                      gmm=GaussianMixture(n_components=500,covariance_type='tied',max_iter=100)
                                      factorsReturns=np.array(factorsReturns)
                                      model=gmm.fit(factorsReturns.T)
                                      
                                      Let's plot the distributions.
                                      In [189]:
                                      def GMM_plotDistribution(samples,GMM_model,number_samples=10000):
                                          
                                          for i in range(len(samples)):
                                              vmin = min(samples[i])
                                              vmax = max(samples[i])
                                              stddev = np.std(samples[i])
                                      
                                              domain = np.arange(vmin, vmax, (vmax-vmin)/100)
                                      
                                              # a simple heuristic to select bandwidth
                                              bandwidth = 1.06 * stddev * pow(len(samples[i]), -.2)
                                      
                                              # estimate density
                                              kde = KDEUnivariate(samples[i])
                                              kde.fit(bw=bandwidth)
                                              density = kde.evaluate(domain)
                                      
                                              # plot
                                              plt.plot(domain, density,label="KDE")
                                              plt.title("{}".format(factor_names[i]))
                                      
                                              # Plot the GMM
                                              kde=KDEUnivariate(GMM_model.sample(number_samples)[0].T[i])
                                              kde.fit(bw=bandwidth)
                                              GMM=kde.evaluate(domain)        
                                              plt.plot(domain,GMM, linewidth=2,color='r',linestyle='--',label="GMM")
                                              plt.legend()
                                              plt.show()
                                      
                                      In [190]:
                                      GMM_plotDistribution(factorsReturns,model)
                                      
                                      We see that GMMs better fit the data.
                                      Let's use them to estimate the VaR.
                                      In [48]:
                                      #function that samples from a GMM and gives the returns of each instrument
                                      def GMM_simulateTrialReturns(data,num_comp,numTrials,weights):
                                          trialReturns = []
                                          gmm=GaussianMixture(n_components=num_comp,covariance_type='tied',max_iter=500)#covariance type is 'tied' because elements are correlated
                                          data=np.array(data)
                                          model=gmm.fit(data.T)
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = model.sample(1)[0].tolist()[0]
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize(list(trialFactorReturns))
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures).tolist())
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      In [192]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      for num in [4,10,16,50,100]:
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          GMM_simulateTrialReturns(factorsReturns,num,
                                                              max(int(numTrials/parallelism), 1), 
                                                              bFactorWeights.value
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("GMM with %d components" %num)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      GMM with 4 components
                                      Value at Risk(VaR) 5%: -454.0385060861455
                                      Conditional Value at Risk(CVaR) 5%: -791.1367732335666
                                      VaR confidence interval:  (-475.9520727810136, -417.002723921976)
                                      CVaR confidence interval:  (-832.4412611449976, -740.5264697453018)
                                      num failures: 185
                                      Kupiec test p-value:  1.0809653289335294e-36
                                      **********************************************************************************
                                      GMM with 10 components
                                      Value at Risk(VaR) 5%: -461.17028081013353
                                      Conditional Value at Risk(CVaR) 5%: -838.1926129688569
                                      VaR confidence interval:  (-483.29720731729094, -425.10504119672714)
                                      CVaR confidence interval:  (-896.9852516775013, -787.3375275158614)
                                      num failures: 182
                                      Kupiec test p-value:  3.4090184733129637e-35
                                      **********************************************************************************
                                      GMM with 16 components
                                      Value at Risk(VaR) 5%: -424.1419758098551
                                      Conditional Value at Risk(CVaR) 5%: -770.7473639157546
                                      VaR confidence interval:  (-447.29711707606094, -405.8981058585983)
                                      CVaR confidence interval:  (-824.2586431965652, -717.8233250069447)
                                      num failures: 200
                                      Kupiec test p-value:  1.512856662123067e-44
                                      **********************************************************************************
                                      GMM with 50 components
                                      Value at Risk(VaR) 5%: -429.8964359814336
                                      Conditional Value at Risk(CVaR) 5%: -924.0745432034413
                                      VaR confidence interval:  (-449.84700095977195, -396.27050317765594)
                                      CVaR confidence interval:  (-1035.8367823821611, -856.7269704614441)
                                      num failures: 197
                                      Kupiec test p-value:  6.2745969472336414e-43
                                      **********************************************************************************
                                      GMM with 100 components
                                      Value at Risk(VaR) 5%: -436.1196637022068
                                      Conditional Value at Risk(CVaR) 5%: -1013.83492923862
                                      VaR confidence interval:  (-469.3213731414521, -404.47673922587825)
                                      CVaR confidence interval:  (-1101.9052685509278, -932.7620261954147)
                                      num failures: 193
                                      Kupiec test p-value:  8.288910673192252e-41
                                      **********************************************************************************
                                      
                                      We can see that none of the models with GMM passed the test.

                                      Full Stocks

                                      We will now use the whole given stocks!
                                      In [69]:
                                      # select path of all stock data files in "stock_folder"
                                      files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                                      
                                      # read each line in each file, convert it into the format: (date, value)
                                      rawStocks = [process_stock_file(f) for f in files]
                                      
                                      # select only instruments which have more than 5 years of history
                                      # Note: the number of business days in a year is 260
                                      number_of_years = 5
                                      stock_names_small=[stock_names[i] for i in range(len(files)) if(len(rawStocks[i])>260*5) ]
                                      rawStocks = list(filter(lambda instrument:len(instrument)>260*5 , rawStocks))
                                      # For testing, print the first 5 entry of the first stock
                                      print(rawStocks[0][:5])
                                      
                                      [(datetime.datetime(1997, 8, 14, 0, 0), 6.011436), (datetime.datetime(1997, 8, 15, 0, 0), 6.473854), (datetime.datetime(1997, 8, 18, 0, 0), 7.47576), (datetime.datetime(1997, 8, 19, 0, 0), 7.39869), (datetime.datetime(1997, 8, 20, 0, 0), 7.39869)]
                                      
                                      Let's do some time alignment.
                                      In [70]:
                                      # trim into a specific time region
                                      # and fill up the missing values
                                      stocks = list(map(lambda stock: \
                                                  fillInHistory(
                                                      trimToRegion(stock, start, end), 
                                                  start, end), 
                                              rawStocks))
                                      
                                      
                                      # test our code
                                      print("the first 5 records of stock 0:", stocks[0][:5], "\n")
                                      print("the last 5 records of stock 0:", stocks[0][-5:], "\n")
                                      print("the first 5 records of factor 0:", factors[0][:5], "\n")
                                      print("the first 5 records of factor 0:", factors[0][-5:], "\n")
                                      
                                      the first 5 records of stock 0: [(datetime.datetime(2009, 1, 23, 0, 0), 16.254108), (datetime.datetime(2009, 1, 26, 0, 0), 16.470275), (datetime.datetime(2009, 1, 27, 0, 0), 16.703071), (datetime.datetime(2009, 1, 28, 0, 0), 17.975132), (datetime.datetime(2009, 1, 29, 0, 0), 16.478589)] 
                                      
                                      the last 5 records of stock 0: [(datetime.datetime(2014, 1, 16, 0, 0), 35.571935), (datetime.datetime(2014, 1, 17, 0, 0), 35.552912), (datetime.datetime(2014, 1, 20, 0, 0), 35.552912), (datetime.datetime(2014, 1, 21, 0, 0), 35.971404), (datetime.datetime(2014, 1, 22, 0, 0), 36.019202)] 
                                      
                                      the first 5 records of factor 0: [(datetime.datetime(2009, 1, 23, 0, 0), 46.47), (datetime.datetime(2009, 1, 26, 0, 0), 45.73), (datetime.datetime(2009, 1, 27, 0, 0), 41.58), (datetime.datetime(2009, 1, 28, 0, 0), 42.16), (datetime.datetime(2009, 1, 29, 0, 0), 41.44)] 
                                      
                                      the first 5 records of factor 0: [(datetime.datetime(2014, 1, 16, 0, 0), 93.96), (datetime.datetime(2014, 1, 17, 0, 0), 94.37), (datetime.datetime(2014, 1, 20, 0, 0), 93.93), (datetime.datetime(2014, 1, 21, 0, 0), 94.99), (datetime.datetime(2014, 1, 22, 0, 0), 96.73)] 
                                      
                                      
                                      Let's estimate the weights.
                                      In [71]:
                                      stocksReturns = list(map(twoWeekReturns, stocks))
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize,factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      #print(factor_columns[0])
                                      # estimate weights
                                      weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
                                      
                                      #print("weights:", weights)
                                      print("size of weights= ",np.array(weights).shape)
                                      print("size of stockReturns= ",np.array(stocksReturns).shape)
                                      print("size of factor_columns= ",np.array(factor_columns).shape)
                                      
                                      size of weights=  (2059, 13)
                                      size of stockReturns=  (2059, 1295)
                                      size of factor_columns=  (1295, 13)
                                      
                                      Let's do the calculations.

                                      Normal Distribution

                                      In [72]:
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(w)/len(w) for w in factorsReturns]
                                      
                                      In [197]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -728319.8751674257
                                      Conditional Value at Risk(CVaR) 5%: -979381.741674978
                                      
                                      In [198]:
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      VaR confidence interval:  (-750607.6086090583, -699770.1825305864)
                                      CVaR confidence interval:  (-1013963.9732510811, -944629.6108551241)
                                      num failures: 110
                                      Kupiec test p-value:  1.3638834244574133e-07
                                      
                                      We observe that with a much higher number of stocks the result of the Kupiec test was better but still not good enough.

                                      Gaussian Mixture Models

                                      In [199]:
                                      gmm=GaussianMixture(n_components=500,covariance_type='tied',max_iter=100)
                                      factorsReturns=np.array(factorsReturns)
                                      model=gmm.fit(factorsReturns.T)
                                      
                                      In [200]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      test_result=[]
                                      for num in [4,10,16,50,100]:
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          GMM_simulateTrialReturns(factorsReturns,num,
                                                              max(int(numTrials/parallelism), 1), 
                                                              bFactorWeights.value
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("GMM with %d components" %num)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          test_result.append(result)
                                          print("**********************************************************************************")
                                      
                                      GMM with 4 components
                                      Value at Risk(VaR) 5%: -747491.6379914016
                                      Conditional Value at Risk(CVaR) 5%: -1042498.5639916708
                                      VaR confidence interval:  (-780769.6361979212, -716912.8016543844)
                                      CVaR confidence interval:  (-1089987.4200476117, -996645.9181555867)
                                      num failures: 99
                                      Kupiec test p-value:  4.7872584616496135e-05
                                      **********************************************************************************
                                      GMM with 10 components
                                      Value at Risk(VaR) 5%: -752402.7474328183
                                      Conditional Value at Risk(CVaR) 5%: -1068744.35515204
                                      VaR confidence interval:  (-770380.9936180705, -715749.4282782662)
                                      CVaR confidence interval:  (-1116796.6843051931, -1030655.1552759467)
                                      num failures: 99
                                      Kupiec test p-value:  4.7872584616496135e-05
                                      **********************************************************************************
                                      GMM with 16 components
                                      Value at Risk(VaR) 5%: -731361.0490930863
                                      Conditional Value at Risk(CVaR) 5%: -1077119.2861471928
                                      VaR confidence interval:  (-763662.119199597, -705463.0534650929)
                                      CVaR confidence interval:  (-1126478.0837678267, -1039511.0162535092)
                                      num failures: 109
                                      Kupiec test p-value:  2.440374037805949e-07
                                      **********************************************************************************
                                      GMM with 50 components
                                      Value at Risk(VaR) 5%: -739421.5699950433
                                      Conditional Value at Risk(CVaR) 5%: -1062312.2232458882
                                      VaR confidence interval:  (-770494.991528184, -709052.2963199407)
                                      CVaR confidence interval:  (-1109966.7485721787, -1026345.5955697779)
                                      num failures: 99
                                      Kupiec test p-value:  4.7872584616496135e-05
                                      **********************************************************************************
                                      GMM with 100 components
                                      Value at Risk(VaR) 5%: -737659.7036595237
                                      Conditional Value at Risk(CVaR) 5%: -1046166.2087551663
                                      VaR confidence interval:  (-765007.936714765, -710105.4750398104)
                                      CVaR confidence interval:  (-1082590.6786667793, -1010814.2987938297)
                                      num failures: 99
                                      Kupiec test p-value:  4.7872584616496135e-05
                                      **********************************************************************************
                                      
                                      We observe that with a higher number of stocks the result of the Kupiec test with GMM was better but still not good enough.

                                      t-Student Distribution

                                      In [205]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      prev_result,count=0,0
                                      for i in (np.arange(30)/2+1.5):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          if(result==prev_result):  #if the result is equal to the result of the previous degree,then increment count
                                              count+=1
                                          else:
                                              count=0   #put count =0
                                              prev_result=result #store the current result
                                          if(count==2): #if last 3 results were the same, then break
                                              break
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 1.500000
                                      Value at Risk(VaR) 5%: -2685953.2186787557
                                      Conditional Value at Risk(CVaR) 5%: -5058840919.284274
                                      VaR confidence interval:  (-3149064.6084377086, -2307596.7070627217)
                                      CVaR confidence interval:  (-18566454628.219543, -90207357.92843434)
                                      num failures: 22
                                      Kupiec test p-value:  3.329689547664531e-10
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.000000
                                      Value at Risk(VaR) 5%: -1628334.5398382754
                                      Conditional Value at Risk(CVaR) 5%: -9440144.66338705
                                      VaR confidence interval:  (-1762418.3622817243, -1482501.6923628438)
                                      CVaR confidence interval:  (-11395682.897416625, -7469994.236089755)
                                      num failures: 42
                                      Kupiec test p-value:  0.001991292838160743
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.500000
                                      Value at Risk(VaR) 5%: -1286144.7287855821
                                      Conditional Value at Risk(CVaR) 5%: -7968241.498106383
                                      VaR confidence interval:  (-1385563.5246726705, -1217608.3564494296)
                                      CVaR confidence interval:  (-14308536.169401087, -3786893.197162104)
                                      num failures: 56
                                      Kupiec test p-value:  0.2539027201840111
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.000000
                                      Value at Risk(VaR) 5%: -1140385.1493384829
                                      Conditional Value at Risk(CVaR) 5%: -4163164.4059714833
                                      VaR confidence interval:  (-1236438.575860597, -1088085.785796404)
                                      CVaR confidence interval:  (-6181076.586763669, -3087804.1168717886)
                                      num failures: 63
                                      Kupiec test p-value:  0.8226854051722098
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.500000
                                      Value at Risk(VaR) 5%: -1061711.4841938734
                                      Conditional Value at Risk(CVaR) 5%: -2345143.2300981665
                                      VaR confidence interval:  (-1117125.0373302684, -1008530.7326959727)
                                      CVaR confidence interval:  (-2681695.5567368977, -2048432.2307862272)
                                      num failures: 70
                                      Kupiec test p-value:  0.5085489050446723
                                      **********************************************************************************
                                      t-Student Distribution with degree = 4.000000
                                      Value at Risk(VaR) 5%: -1048125.6718739638
                                      Conditional Value at Risk(CVaR) 5%: -2199184.209274472
                                      VaR confidence interval:  (-1086194.5637488402, -981718.3693999872)
                                      CVaR confidence interval:  (-2500706.9556317558, -1894518.719171686)
                                      num failures: 70
                                      Kupiec test p-value:  0.5085489050446723
                                      **********************************************************************************
                                      t-Student Distribution with degree = 4.500000
                                      Value at Risk(VaR) 5%: -958247.3572734002
                                      Conditional Value at Risk(CVaR) 5%: -1812121.4209447796
                                      VaR confidence interval:  (-984937.5263099344, -915976.4796372048)
                                      CVaR confidence interval:  (-2051011.5964786555, -1625178.2318140755)
                                      num failures: 81
                                      Kupiec test p-value:  0.04577388317787937
                                      **********************************************************************************
                                      t-Student Distribution with degree = 5.000000
                                      Value at Risk(VaR) 5%: -903547.6942041294
                                      Conditional Value at Risk(CVaR) 5%: -1596262.173163424
                                      VaR confidence interval:  (-954572.7546838158, -859454.3979329692)
                                      CVaR confidence interval:  (-1728957.719376945, -1475958.369723101)
                                      num failures: 86
                                      Kupiec test p-value:  0.009722099905175786
                                      **********************************************************************************
                                      t-Student Distribution with degree = 5.500000
                                      Value at Risk(VaR) 5%: -911005.0900465912
                                      Conditional Value at Risk(CVaR) 5%: -1771205.8704705548
                                      VaR confidence interval:  (-954394.1455015664, -874516.1054016564)
                                      CVaR confidence interval:  (-2451904.916877189, -1444226.914320677)
                                      num failures: 84
                                      Kupiec test p-value:  0.018689333857969618
                                      **********************************************************************************
                                      t-Student Distribution with degree = 6.000000
                                      Value at Risk(VaR) 5%: -875426.9661313522
                                      Conditional Value at Risk(CVaR) 5%: -1349160.4735452405
                                      VaR confidence interval:  (-914873.3442912848, -840550.5064734475)
                                      CVaR confidence interval:  (-1458469.4121922806, -1283806.2955146418)
                                      num failures: 88
                                      Kupiec test p-value:  0.004836626495626101
                                      **********************************************************************************
                                      t-Student Distribution with degree = 6.500000
                                      Value at Risk(VaR) 5%: -858856.2536283074
                                      Conditional Value at Risk(CVaR) 5%: -1397953.8039567852
                                      VaR confidence interval:  (-901870.2996487828, -839146.4087061066)
                                      CVaR confidence interval:  (-1497657.186161434, -1294143.1703240422)
                                      num failures: 88
                                      
                                      We can see that the model gets good values when the degree is between 2.5 and 3.5.
                                      Let's try values in this range to see if we can get better results.
                                      In [206]:
                                      prev_result,count=0,0 #values used to stop iterating when a certain result is repeated several times
                                      
                                      for i in ([2.6,2.7,2.8,2.9,3.1,3.2,3.3]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          if(result==prev_result):  #if the result is equal to the result of the previous degree,then increment count
                                              count+=1
                                          else:
                                              count=0   #put count=0
                                              prev_result=result  #store current result
                                          if(count==2): #if last 3 results were the same, then break
                                              break
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 2.600000
                                      Value at Risk(VaR) 5%: -1255163.2550978207
                                      Conditional Value at Risk(CVaR) 5%: -7063828.20397346
                                      VaR confidence interval:  (-1361622.3028134652, -1192991.8016141183)
                                      CVaR confidence interval:  (-11968848.363086285, -4167589.673765932)
                                      num failures: 56
                                      Kupiec test p-value:  0.2539027201840111
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.700000
                                      Value at Risk(VaR) 5%: -1153835.6411847011
                                      Conditional Value at Risk(CVaR) 5%: -3798095.7062255917
                                      VaR confidence interval:  (-1212078.8017489705, -1075380.310235502)
                                      CVaR confidence interval:  (-4475761.070231479, -2979430.0864824927)
                                      num failures: 63
                                      Kupiec test p-value:  0.8226854051722098
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.800000
                                      Value at Risk(VaR) 5%: -1241814.5715813513
                                      Conditional Value at Risk(CVaR) 5%: -3482318.7447110857
                                      VaR confidence interval:  (-1331703.0359929914, -1160769.7700885846)
                                      CVaR confidence interval:  (-4075548.6520813853, -3023121.0576557927)
                                      num failures: 58
                                      Kupiec test p-value:  0.3813182755356782
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.900000
                                      Value at Risk(VaR) 5%: -1139310.0242110428
                                      Conditional Value at Risk(CVaR) 5%: -4829410.878152002
                                      VaR confidence interval:  (-1206542.5243878, -1077635.1130832455)
                                      CVaR confidence interval:  (-8011043.029267564, -3107245.933796357)
                                      num failures: 63
                                      Kupiec test p-value:  0.8226854051722098
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.100000
                                      Value at Risk(VaR) 5%: -1064596.7705663966
                                      Conditional Value at Risk(CVaR) 5%: -3312910.995314726
                                      VaR confidence interval:  (-1143233.3902890638, -1015326.1759208881)
                                      CVaR confidence interval:  (-4189631.0655039595, -2704022.1059850203)
                                      num failures: 69
                                      Kupiec test p-value:  0.5916686667593931
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.200000
                                      Value at Risk(VaR) 5%: -1107737.214542686
                                      Conditional Value at Risk(CVaR) 5%: -3552907.0221760166
                                      VaR confidence interval:  (-1163403.534162734, -1061935.1281334017)
                                      CVaR confidence interval:  (-5693633.88281001, -2253330.7142919204)
                                      num failures: 66
                                      Kupiec test p-value:  0.8737507388380307
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.300000
                                      Value at Risk(VaR) 5%: -1076460.3020410403
                                      Conditional Value at Risk(CVaR) 5%: -2963628.893261669
                                      VaR confidence interval:  (-1139630.4636505498, -1030547.1668933205)
                                      CVaR confidence interval:  (-3726521.343293376, -2360270.988622486)
                                      num failures: 66
                                      Kupiec test p-value:  0.8737507388380307
                                      **********************************************************************************
                                      
                                      Let's try even more values! This time between 3.1 and 3.2
                                      In [208]:
                                      prev_result,count=0,0
                                      for i in ([3.11,3.13,3.15,3.17,3.19]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          if(result==prev_result):  #if the result is equal to the result of the previous degree,then increment count
                                              count+=1
                                          else:
                                              count=0   #put count=0
                                              prev_result=result   #store previous result
                                          if(count==2): #if last 3 results were the same, then break
                                              break
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 3.110000
                                      Value at Risk(VaR) 5%: -1127690.9993057156
                                      Conditional Value at Risk(CVaR) 5%: -2961909.2393828337
                                      VaR confidence interval:  (-1205523.486371217, -1047573.228761778)
                                      CVaR confidence interval:  (-3995420.9175926694, -2335097.919342885)
                                      num failures: 66
                                      Kupiec test p-value:  0.8737507388380307
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.130000
                                      Value at Risk(VaR) 5%: -1130405.455889774
                                      Conditional Value at Risk(CVaR) 5%: -3384488.3734709173
                                      VaR confidence interval:  (-1186739.001808746, -1073004.744910484)
                                      CVaR confidence interval:  (-4505571.891186147, -2812542.7332916474)
                                      num failures: 65
                                      Kupiec test p-value:  0.9745867316828987
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.150000
                                      Value at Risk(VaR) 5%: -1115602.0287505214
                                      Conditional Value at Risk(CVaR) 5%: -2712368.714924409
                                      VaR confidence interval:  (-1168691.5791491852, -1061877.7767577774)
                                      CVaR confidence interval:  (-3277187.835853535, -2358551.7442211844)
                                      num failures: 66
                                      Kupiec test p-value:  0.8737507388380307
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.170000
                                      Value at Risk(VaR) 5%: -1119690.5466896703
                                      Conditional Value at Risk(CVaR) 5%: -3184855.461761877
                                      VaR confidence interval:  (-1166786.2954655157, -1055841.90671653)
                                      CVaR confidence interval:  (-4270345.456869589, -2517346.6412077458)
                                      num failures: 66
                                      Kupiec test p-value:  0.8737507388380307
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.190000
                                      Value at Risk(VaR) 5%: -1056523.7015881715
                                      Conditional Value at Risk(CVaR) 5%: -3115025.6938437754
                                      VaR confidence interval:  (-1105099.8399632135, -1015533.5267579453)
                                      CVaR confidence interval:  (-4706132.767105792, -2012351.8059234146)
                                      num failures: 70
                                      Kupiec test p-value:  0.5085489050446723
                                      **********************************************************************************
                                      
                                      We can see that with a degree of freedom of 3.13,we get a Kupiec test p-value of 0.97.

                                      Let's add more factors!

                                      We will add the following factors:
                                      • The EUR_USD Conversion rate (investing.com)
                                      • GOLD (investing.com)
                                      • Silver (investing.com)
                                      • Natural Gas (investing.com)
                                      • N225 (yahoo.com)
                                      • RUT (yahoo.com)
                                      • TNX (yahoo.com)
                                      • VIX (yahoo.com)
                                      The data is already in the requested time range.
                                      We need to prepocess the data again. The files have differenct separators and date formats,so we will tackle each group separately.
                                      In [73]:
                                      base_folder = "monte-carlo-risk/"
                                      
                                      factors_folder= base_folder + "factors/"
                                      
                                      # read data from local disk
                                      #the data we have is split by commas
                                      def readInvestingDotComHistory(fname):
                                          def process_line(line):
                                              cols = line.split(",")
                                              #print(cols)
                                              d=cols[0]+cols[1] #in our file the date is splitted because there is a comma inside it, we concatenate the string here
                                              d.replace('"','')
                                              #print(d)
                                              date = datetime.strptime(d, '"%b %d %Y"')
                                              value = float(cols[2].replace('"',''))
                                              return (date, value)
                                          
                                          with open(fname) as f:
                                              content_w_header = f.readlines()
                                              # remove the first line 
                                              # and reverse lines to sort the data by date, in ascending order
                                              content = content_w_header[:0:-1]
                                              return list(map(process_line , content))
                                      
                                      factor3_files = ['EUR_USD.csv', 'Gold.csv','Natural_Gas.csv','Silver.csv']
                                      factor3_files = map(lambda fn: factors_folder + fn, factor3_files)
                                      factors3 = [readInvestingDotComHistory(f) for f in factor3_files]
                                      
                                      #testing
                                      print(factors3[0][:5])
                                      
                                      [(datetime.datetime(2009, 1, 23, 0, 0), 1.2992), (datetime.datetime(2009, 1, 26, 0, 0), 1.3166), (datetime.datetime(2009, 1, 27, 0, 0), 1.3178), (datetime.datetime(2009, 1, 28, 0, 0), 1.3138), (datetime.datetime(2009, 1, 29, 0, 0), 1.2959)]
                                      
                                      In [74]:
                                      # read data from local disk
                                      def readYahooHistory(fname):
                                          def process_line(line):
                                              cols=line.split(",")
                                              #print(cols)
                                              if('null' not in cols): #we have 'null' values in the file
                                                  date=datetime.strptime(cols[0],"%Y-%m-%d")
                                                  #print(cols)
                                                  value=float(cols[5])
                                                  return(date,value)
                                          
                                          with open(fname) as f:
                                              lines_w_header=f.readlines()
                                              lines=lines_w_header[1:]
                                              return(list(map(process_line,lines)))
                                              #...
                                          
                                      
                                      factor4_files = ["N225.csv"]
                                      factor4_files = map(lambda fn: factors_folder + fn, factor4_files)
                                      
                                      factors4 = [readYahooHistory(f) for f in factor4_files]
                                      factors4[0].remove(None) #there are nome Nonetype object and we need to remove them
                                      #testing
                                      print(factors4[0][:5])
                                      print("\n")
                                      print(factors4[-1][-5:])
                                      
                                      [(datetime.datetime(2009, 1, 23, 0, 0), 7745.25), (datetime.datetime(2009, 1, 26, 0, 0), 7682.140137), (datetime.datetime(2009, 1, 27, 0, 0), 8061.069824), (datetime.datetime(2009, 1, 28, 0, 0), 8106.290039), (datetime.datetime(2009, 1, 29, 0, 0), 8251.240234)]
                                      
                                      
                                      [(datetime.datetime(2014, 1, 20, 0, 0), 15641.679688), (datetime.datetime(2014, 1, 21, 0, 0), 15795.959961), (datetime.datetime(2014, 1, 22, 0, 0), 15820.959961), (datetime.datetime(2014, 1, 23, 0, 0), 15695.889648), (datetime.datetime(2014, 1, 24, 0, 0), 15391.55957)]
                                      
                                      In [75]:
                                      # read data from local disk
                                      def readYahooHistory(fname):
                                          def process_line(line):
                                              cols=line.split(",")
                                              #print(cols)
                                              if('null' not in cols): #we have 'null' values in the file
                                                  date=datetime.strptime(cols[0],"%m/%d/%Y")
                                                  #print(cols)
                                                  value=float(cols[5])
                                                  return(date,value)
                                          
                                          with open(fname) as f:
                                              lines_w_header=f.readlines()
                                              lines=lines_w_header[1:]
                                              return(list(map(process_line,lines)))
                                              #...
                                          
                                      
                                      factor5_files = ["RUT.csv","TNX.csv","VIX.csv"]
                                      factor5_files = map(lambda fn: factors_folder + fn, factor5_files)
                                      
                                      factors5 = [readYahooHistory(f) for f in factor5_files]
                                      
                                      #testing for Nonetype objects and removing them
                                      for i in range(len(factors5)):
                                          if(None in factors5[i]):
                                              factors5[i].remove(None)
                                      
                                      #testing
                                      print(factors5[0][:5])
                                      print("\n")
                                      print(factors5[1][:5])
                                      print("\n")
                                      print(factors5[2][:5])
                                      
                                      [(datetime.datetime(2009, 1, 23, 0, 0), 444.359985), (datetime.datetime(2009, 1, 26, 0, 0), 450.059998), (datetime.datetime(2009, 1, 27, 0, 0), 455.579987), (datetime.datetime(2009, 1, 28, 0, 0), 473.019989), (datetime.datetime(2009, 1, 29, 0, 0), 453.23999)]
                                      
                                      
                                      [(datetime.datetime(2009, 1, 23, 0, 0), 2.622), (datetime.datetime(2009, 1, 26, 0, 0), 2.643), (datetime.datetime(2009, 1, 27, 0, 0), 2.519), (datetime.datetime(2009, 1, 28, 0, 0), 2.656), (datetime.datetime(2009, 1, 29, 0, 0), 2.815)]
                                      
                                      
                                      [(datetime.datetime(2009, 1, 23, 0, 0), 47.27), (datetime.datetime(2009, 1, 26, 0, 0), 45.689999), (datetime.datetime(2009, 1, 27, 0, 0), 42.25), (datetime.datetime(2009, 1, 28, 0, 0), 39.66), (datetime.datetime(2009, 1, 29, 0, 0), 42.630001)]
                                      
                                      Time alignement of data
                                      In [76]:
                                      allfactors = factors1+factors2+factors3+factors4+factors5
                                      factors = list(map( lambda factor: \
                                                  fillInHistory(
                                                      trimToRegion(factor, start,end),
                                                  start,end), 
                                                  allfactors
                                                  ))
                                      
                                      #testing
                                      print("the first 5 records of factor 9:", factors[9][:5], "\n")
                                      print("the first 5 records of factor 8:", factors[8][-5:], "\n")
                                      
                                      the first 5 records of factor 9: [(datetime.datetime(2009, 1, 23, 0, 0), 444.359985), (datetime.datetime(2009, 1, 26, 0, 0), 450.059998), (datetime.datetime(2009, 1, 27, 0, 0), 455.579987), (datetime.datetime(2009, 1, 28, 0, 0), 473.019989), (datetime.datetime(2009, 1, 29, 0, 0), 453.23999)] 
                                      
                                      the first 5 records of factor 8: [(datetime.datetime(2014, 1, 16, 0, 0), 15747.200195), (datetime.datetime(2014, 1, 17, 0, 0), 15734.459961), (datetime.datetime(2014, 1, 20, 0, 0), 15641.679688), (datetime.datetime(2014, 1, 21, 0, 0), 15795.959961), (datetime.datetime(2014, 1, 22, 0, 0), 15820.959961)] 
                                      
                                      
                                      In [77]:
                                      #getting two weeks returns
                                      factorsReturns = list(map(twoWeekReturns, factors))
                                      
                                      In [78]:
                                      factor_names=["Crude Oil","Treasury Bond","GSPC","IXIC","EUR_USD","Gold","Natural Gas","Silver","N225","RUT","TNX","VIX"]
                                      
                                      Let's do the calculations.
                                      In [79]:
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize,factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      #print(factor_columns[0])
                                      # estimate weightsdd
                                      weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
                                      
                                      #print("weights:", weights)
                                      print("size of weights= ",np.array(weights).shape)
                                      print("size of stockReturns= ",np.array(stocksReturns).shape)
                                      print("size of factor_columns= ",np.array(factor_columns).shape)
                                      
                                      size of weights=  (2059, 37)
                                      size of stockReturns=  (2059, 1295)
                                      size of factor_columns=  (1295, 37)
                                      
                                      In [80]:
                                      correlation = np.corrcoef(factorsReturns)
                                      
                                      In [223]:
                                      cmap = sns.diverging_palette(220, 10,as_cmap=True)
                                      mask = np.zeros_like(correlation, dtype=np.bool)
                                      mask[np.triu_indices_from(mask)] = True
                                      # Draw the heatmap with the mask and correct aspect ratio
                                      plt.figure(figsize=(20, 20))
                                      sns.heatmap(correlation,vmin=0.2,vmax=0.95,cmap=cmap, center=0,annot=True,
                                                  square=True, linewidths=.5, cbar_kws={"shrink": .5},xticklabels=factor_names,yticklabels=factor_names)
                                      plt.xticks(rotation=0)
                                      plt.yticks(rotation=0)
                                      plt.title("Correlations between Market Factors")
                                      plt.show()
                                      
                                      We see that we get some high and some low correlations. Still we will use multivaruate distributions.
                                      In [81]:
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(w)/len(w) for w in factorsReturns]
                                      

                                      Normal Distribution

                                      In [224]:
                                      ### RUN SILMULATION
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -1488337.4434537566
                                      Conditional Value at Risk(CVaR) 5%: -1832658.6628498153
                                      
                                      In [225]:
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      VaR confidence interval:  (-1533827.516789591, -1445890.9292617813)
                                      CVaR confidence interval:  (-1878601.9606241966, -1778149.0152779056)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      
                                      We see that with more stocks and more market factors the result is getting better but still not enough.

                                      t-Student Distribution

                                      In [64]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([2,2.5,3,3.13,4,5]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 2.000000
                                      Value at Risk(VaR) 5%: -2804654.4627770195
                                      Conditional Value at Risk(CVaR) 5%: -20460615.883276355
                                      VaR confidence interval:  (-2986975.955926591, -2634209.1677215863)
                                      CVaR confidence interval:  (-25977833.76028195, -15156878.397091812)
                                      num failures: 21
                                      Kupiec test p-value:  1.039752130181726e-10
                                      **********************************************************************************
                                      t-Student Distribution with degree = 2.500000
                                      Value at Risk(VaR) 5%: -2335934.9297755715
                                      Conditional Value at Risk(CVaR) 5%: -8583236.603263449
                                      VaR confidence interval:  (-2452710.1866286276, -2243083.4522132217)
                                      CVaR confidence interval:  (-10861332.001930404, -6619541.062647096)
                                      num failures: 25
                                      Kupiec test p-value:  8.367233845438816e-09
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.000000
                                      Value at Risk(VaR) 5%: -2042903.3302096953
                                      Conditional Value at Risk(CVaR) 5%: -4321464.767939782
                                      VaR confidence interval:  (-2139524.2960873735, -1966311.3930434901)
                                      CVaR confidence interval:  (-4836117.829275416, -3852648.0417207126)
                                      num failures: 31
                                      Kupiec test p-value:  1.8434639425521288e-06
                                      **********************************************************************************
                                      t-Student Distribution with degree = 3.130000
                                      Value at Risk(VaR) 5%: -1960462.9459432047
                                      Conditional Value at Risk(CVaR) 5%: -4707846.944440918
                                      VaR confidence interval:  (-2046867.717492859, -1889213.1898982578)
                                      CVaR confidence interval:  (-5539412.884029886, -3880699.626208414)
                                      num failures: 31
                                      Kupiec test p-value:  1.8434639425521288e-06
                                      **********************************************************************************
                                      t-Student Distribution with degree = 4.000000
                                      Value at Risk(VaR) 5%: -1903689.66745569
                                      Conditional Value at Risk(CVaR) 5%: -3366153.222897681
                                      VaR confidence interval:  (-1980250.5705120123, -1822181.3902581981)
                                      CVaR confidence interval:  (-3675722.760016788, -3041830.395397834)
                                      num failures: 31
                                      Kupiec test p-value:  1.8434639425521288e-06
                                      **********************************************************************************
                                      t-Student Distribution with degree = 5.000000
                                      Value at Risk(VaR) 5%: -1778310.1997009763
                                      Conditional Value at Risk(CVaR) 5%: -2575423.5515607055
                                      VaR confidence interval:  (-1835007.550274154, -1723411.9038188758)
                                      CVaR confidence interval:  (-2700413.976601015, -2439103.764471156)
                                      num failures: 33
                                      Kupiec test p-value:  8.478944638234279e-06
                                      **********************************************************************************
                                      
                                      Let's try some more values.
                                      In [65]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([7,9]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 7.000000
                                      Value at Risk(VaR) 5%: -1671961.584079518
                                      Conditional Value at Risk(CVaR) 5%: -2292354.0274360124
                                      VaR confidence interval:  (-1747927.1870239286, -1613899.4029196298)
                                      CVaR confidence interval:  (-2406503.6810444035, -2211865.590453775)
                                      num failures: 39
                                      Kupiec test p-value:  0.00040882445394070145
                                      **********************************************************************************
                                      t-Student Distribution with degree = 9.000000
                                      Value at Risk(VaR) 5%: -1569295.1018717387
                                      Conditional Value at Risk(CVaR) 5%: -2082908.4952525848
                                      VaR confidence interval:  (-1632273.9590942021, -1511714.4705332366)
                                      CVaR confidence interval:  (-2169418.229452313, -2002181.233983589)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      
                                      Let's try also some more values.
                                      In [66]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([10,12,14,16]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 10.000000
                                      Value at Risk(VaR) 5%: -1583394.6197982419
                                      Conditional Value at Risk(CVaR) 5%: -2094776.7412810933
                                      VaR confidence interval:  (-1636051.7848420748, -1522268.5713133977)
                                      CVaR confidence interval:  (-2189119.3117517484, -2033302.442599455)
                                      num failures: 43
                                      Kupiec test p-value:  0.003217931994360169
                                      **********************************************************************************
                                      t-Student Distribution with degree = 12.000000
                                      Value at Risk(VaR) 5%: -1571307.1888106265
                                      Conditional Value at Risk(CVaR) 5%: -2003264.6660045895
                                      VaR confidence interval:  (-1612097.4047732921, -1515473.9787900741)
                                      CVaR confidence interval:  (-2052100.7441320622, -1939622.121564347)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 14.000000
                                      Value at Risk(VaR) 5%: -1598451.178809904
                                      Conditional Value at Risk(CVaR) 5%: -2038004.7206620437
                                      VaR confidence interval:  (-1644933.4706408205, -1555008.05165683)
                                      CVaR confidence interval:  (-2113589.391803851, -1983957.8524277448)
                                      num failures: 42
                                      Kupiec test p-value:  0.001991292838160743
                                      **********************************************************************************
                                      t-Student Distribution with degree = 16.000000
                                      Value at Risk(VaR) 5%: -1544904.672709585
                                      Conditional Value at Risk(CVaR) 5%: -1995902.453980746
                                      VaR confidence interval:  (-1591680.9116474811, -1495426.4886671456)
                                      CVaR confidence interval:  (-2046512.9961946572, -1942462.4577002407)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      
                                      Maybe it's better with higher degrees.
                                      In [67]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([20,25,30]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 20.000000
                                      Value at Risk(VaR) 5%: -1517621.2343712696
                                      Conditional Value at Risk(CVaR) 5%: -1949742.9831535066
                                      VaR confidence interval:  (-1570987.219087766, -1475300.0647058266)
                                      CVaR confidence interval:  (-2008808.5492850863, -1897810.4742964334)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 25.000000
                                      Value at Risk(VaR) 5%: -1510495.497529316
                                      Conditional Value at Risk(CVaR) 5%: -1906528.608164858
                                      VaR confidence interval:  (-1559483.8920631534, -1468330.491833573)
                                      CVaR confidence interval:  (-1951100.6377262603, -1853782.035854154)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 30.000000
                                      Value at Risk(VaR) 5%: -1515939.6375867228
                                      Conditional Value at Risk(CVaR) 5%: -1870610.8853075088
                                      VaR confidence interval:  (-1569736.9927143166, -1458883.5864716019)
                                      CVaR confidence interval:  (-1920293.7457548818, -1826775.4824128314)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      
                                      Let's narrow the range.
                                      In [68]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([13,13.3,13.5,13.7]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 13.000000
                                      Value at Risk(VaR) 5%: -1553139.1986955793
                                      Conditional Value at Risk(CVaR) 5%: -1974298.5560178768
                                      VaR confidence interval:  (-1602781.5924627043, -1507658.493550339)
                                      CVaR confidence interval:  (-2030297.9578518607, -1918272.3126321705)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 13.300000
                                      Value at Risk(VaR) 5%: -1546747.4705161834
                                      Conditional Value at Risk(CVaR) 5%: -1987457.7780322167
                                      VaR confidence interval:  (-1599406.141826442, -1518088.055727921)
                                      CVaR confidence interval:  (-2052962.6560567159, -1930915.4165076388)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 13.500000
                                      Value at Risk(VaR) 5%: -1579787.955565128
                                      Conditional Value at Risk(CVaR) 5%: -2023415.3921686749
                                      VaR confidence interval:  (-1618485.9785270628, -1518422.3265124052)
                                      CVaR confidence interval:  (-2080541.2334068448, -1967489.5991566714)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      t-Student Distribution with degree = 13.700000
                                      Value at Risk(VaR) 5%: -1581895.427196923
                                      Conditional Value at Risk(CVaR) 5%: -2011225.1083148227
                                      VaR confidence interval:  (-1615616.9897984166, -1529525.4380296783)
                                      CVaR confidence interval:  (-2059034.2108246756, -1958366.0772564956)
                                      num failures: 44
                                      Kupiec test p-value:  0.005082771859813193
                                      **********************************************************************************
                                      
                                      In [69]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      for i in ([8,8.5]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      
                                      t-Student Distribution with degree = 8.000000
                                      Value at Risk(VaR) 5%: -1611509.4613866855
                                      Conditional Value at Risk(CVaR) 5%: -2140570.448234189
                                      VaR confidence interval:  (-1652124.404719935, -1578369.0808849803)
                                      CVaR confidence interval:  (-2212343.386171017, -2074134.6289488154)
                                      num failures: 42
                                      Kupiec test p-value:  0.001991292838160743
                                      **********************************************************************************
                                      t-Student Distribution with degree = 8.500000
                                      Value at Risk(VaR) 5%: -1644995.1525186433
                                      Conditional Value at Risk(CVaR) 5%: -2198692.9267220516
                                      VaR confidence interval:  (-1698654.1651146528, -1601758.0688966163)
                                      CVaR confidence interval:  (-2301156.5177420117, -2098847.4766343506)
                                      num failures: 40
                                      Kupiec test p-value:  0.0007102343104543985
                                      **********************************************************************************
                                      
                                      We can see that we are not able to pass the Kupiec test using t-Student distribution with more market factors.

                                      GMM

                                      I am going to paste the outputs directly because it is taking a lot of time.
                                      In [ ]:
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      test_result=[]
                                      for num in [4,10,16,50,100]:
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          GMM_simulateTrialReturns(factorsReturns,num,
                                                              max(int(numTrials/parallelism), 1), 
                                                              bFactorWeights.value
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("GMM with %d components" %num)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          test_result.append(result)
                                          print("**********************************************************************************")
                                      

                                      Recomputing the above code will take a lot of time and every 6 hours the notebook is terminated so I'll post the results of the first 4 GMMs

                                      GMM with 4 components
                                      Value at Risk(VaR) 5%: -1253337.5423792657
                                      Conditional Value at Risk(CVaR) 5%: -1670735.285886857
                                      VaR confidence interval: (-1289391.8784495369, -1212514.3676251906)
                                      CVaR confidence interval: (-1725336.5121653462, -1625109.6935056315)
                                      num failures: 56
                                      Kupiec test p-value: 0.2539027201840111



                                      GMM with 10 components
                                      Value at Risk(VaR) 5%: -877306.538078644
                                      Conditional Value at Risk(CVaR) 5%: -1314281.926748762
                                      VaR confidence interval: (-912025.2671519847, -843758.861645978)
                                      CVaR confidence interval: (-1379594.7312052492, -1258677.2989874093)
                                      num failures: 88
                                      Kupiec test p-value: 0.004836626495626101



                                      GMM with 16 components
                                      Value at Risk(VaR) 5%: -875361.7038092854
                                      Conditional Value at Risk(CVaR) 5%: -1329417.7253791727
                                      VaR confidence interval: (-910755.6857189045, -837747.5754454148)
                                      CVaR confidence interval: (-1382147.7961706482, -1283941.8189907416)
                                      num failures: 88
                                      Kupiec test p-value: 0.004836626495626101



                                      GMM with 50 components
                                      Value at Risk(VaR) 5%: -890502.1606097086
                                      Conditional Value at Risk(CVaR) 5%: -1391443.5774341524
                                      VaR confidence interval: (-971934.4527563257, -833638.4779192186)
                                      CVaR confidence interval: (-1449848.9926762576, -1330670.9563429244)
                                      num failures: 88
                                      Kupiec test p-value: 0.004836626495626101


                                      We see that we get better results here and even one of the models passed the test.

                                      Let's add more non-linearity to the features

                                      In [83]:
                                      def featurize2(factorReturns):
                                          squaredReturns = [w**2 if w>=0 else -w**2 for w in factorReturns]
                                          squareRootedReturns = [np.sqrt(w) if w>=0 else -np.sqrt(-w) for w in factorReturns]
                                          expReturns=[np.exp(-w) if w>=0 else -np.exp(-w) for w in factorReturns]
                                          sinReturns=[math.sin(w) for w in factorReturns]
                                          cosReturns=[math.cos(w) for w in factorReturns]
                                          
                                          # concat new features
                                          return squaredReturns + squareRootedReturns + factorReturns + cosReturns + sinReturns + cosReturns
                                      
                                      # test our function
                                      
                                      Let's do the calculations.
                                      In [84]:
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize2,factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      #print(factor_columns[0])
                                      # estimate weights
                                      weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
                                      
                                      #print("weights:", weights)
                                      print("size of weights= ",np.array(weights).shape)
                                      print("size of stockReturns= ",np.array(stocksReturns).shape)
                                      print("size of factor_columns= ",np.array(factor_columns).shape)
                                      
                                      /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
                                        after removing the cwd from sys.path.
                                      
                                      size of weights=  (2059, 73)
                                      size of stockReturns=  (2059, 1295)
                                      size of factor_columns=  (1295, 73)
                                      
                                      In [85]:
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(w)/len(w) for w in factorsReturns]
                                      

                                      Normal Distribution

                                      In [231]:
                                      ### RUN SILMULATION
                                      def simulateTrialReturns(numTrials, factorMeans, factorCov, weights):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = list(np.random.multivariate_normal(factorMeans, factorCov))
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize2(trialFactorReturns)
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures))
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      
                                              
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -2054248.0135375736
                                      Conditional Value at Risk(CVaR) 5%: -2569470.8491307055
                                      
                                      In [232]:
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      VaR confidence interval:  (-2096071.8828769408, -1984065.6091565068)
                                      CVaR confidence interval:  (-2629033.201952602, -2497902.6105502946)
                                      num failures: 31
                                      Kupiec test p-value:  1.8434639425521288e-06
                                      
                                      We see that with the full given stocks and with more market factors and adding some non-linearity to the features, we still don't pass the Kupiec test.

                                      For the following, I am going to post the code and some results that I managed to get. Computations here are taking too much time and I couldn't manage to finish them all because the notebook terminates every six hours.

                                      GMM

                                      In [ ]:
                                      def GMM_simulateTrialReturns(data,num_comp,numTrials,weights):
                                          trialReturns = []
                                          gmm=GaussianMixture(n_components=num_comp,covariance_type='tied',max_iter=500)#covariance type is 'tied' because elements are correlated
                                          data=np.array(data)
                                          model=gmm.fit(data.T)
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = model.sample(1)[0].tolist()[0]
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize2(list(trialFactorReturns))
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures).tolist())
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      
                                      
                                      
                                      
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      test_result=[]
                                      for num in [10,16]:
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          GMM_simulateTrialReturns(factorsReturns,num,
                                                              max(int(numTrials/parallelism), 1), 
                                                              bFactorWeights.value
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("GMM with %d components" %num)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          print("Kupiec test p-value: " , result)
                                          test_result.append(result)
                                          print("**********************************************************************************")
                                      

                                      Here is the result:
                                      GMM with 4 components
                                      Value at Risk(VaR) 5%: -1744384.911798203
                                      Conditional Value at Risk(CVaR) 5%: -2234447.196914374
                                      VaR confidence interval: (-1784709.5466326443, -1697832.5499408026)
                                      CVaR confidence interval: (-2282299.5271949237, -2156376.068742758)
                                      num failures: 35
                                      Kupiec test p-value: 3.4528356934382664e-05

                                      We still didn't pass the test.

                                      The following has been running for a very long time with no printed results. I'll just present the code.

                                      t-Student

                                      In [ ]:
                                      def Student_simulateTrialReturns(numTrials, factorMeans, factorCov, weights,degree):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = mul.multivariate_t_rvs(factorMeans,factorCov,df=degree,n=1).tolist()[0]
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize2(trialFactorReturns)
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              trialFeatures=np.array(trialFeatures)
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = sum(np.dot(weights,trialFeatures).tolist())
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      
                                      
                                      
                                      
                                      prev_result,count=0,0
                                      for i in ([2,3,5,8,10]):
                                      
                                          
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          Student_simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,i
                                                          ))
                                          trials.cache()
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                          
                                          
                                          print("t-Student Distribution with degree = %f" %i)
                                          print ("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print ("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          result=kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                          if(result==prev_result):  #if the result is equal to the result of the previous degree,then increment count
                                              count+=1
                                          else:
                                              count=0
                                              prev_result=result
                                          if(count==2): #if last 3 results were the same, then break
                                              break
                                          print("Kupiec test p-value: " , result)
                                          print("**********************************************************************************")
                                      

                                      Conclusions and Further Notes

                                      • We managed to pass the Kupiec test under certain conditions and distributions shown above.
                                      • It looks like we could do better if we could get increase the number of stocks.
                                      • Picking the market factors to use requires running various experiments and a bit of creativity
                                      • Although we used a linear model for predictions and we forced some non-linearity, we could have used non-linear functions.
                                      • Another approach we could have used was try other multivariate distributions,but following the Prof.'s advice we will stick to the normal distribution as much as possible.
                                      • A log-normal distribution doesn't fit well to stock returns which are positive and negative. Instead, a log normal distribution best fits the stock values themselves.
                                      • Another thing we could have done was use PCA to uncorrelate the factors and try to find the suitable univariate distribution to fit it,however,this approach,although used, diminished some information that is crucial for predicition.
                                      • We tried implementing Gaussian Processes and Geometric Brownian Motion but with no luck and no time. We left what we did at the end. We will return to them later after covering Gaussian Processes in ASI.

                                      6. Summary

                                      In this lecture, we studied the Monte Carlo Simulation method and its application to estimate financial risk. To apply it, first, we needed to define the relationship between market factors and the instruments' returns. In such step, you must define the model which maps the market factors' values to the instruments' values: in our use case, we used a linear regression function for building our model. Next, we also had to find the parameters of our model, which are the weights of the factors we considered. Then, we had to study the distribution of each market factor. A good way to do that is using Kernel density estimation to smooth the distribution and plot it. Depending on the shape of each figure, we had to guess the best fit distribution for each factor: in our use case, we used a very simple approach, and decided that our smoothed distributions all looked normal distributions.

                                      Then, the idea of Monte Carlo simulation was to generate many possible values for each factor and calculate the corresponding outcomes by a well-defined model in each trial. After many trials, we were able to calculate VaR from the sequences of outcome's values. When the number of trials is large enough, the VaR converges to reasonable values, that we could validate using well-known statistical hypothesis.

                                      Uncompleted Tests

                                      Brownian Motion

                                      Apparantly, we can model market factors with Brownian Motion.
                                      In [ ]:
                                      seed=5
                                      def Brownian(seed, N):
                                          
                                          np.random.seed(seed)                         
                                          dt = 1./N                                    # time step
                                          b = np.random.normal(0., 1., int(N))*np.sqrt(dt)  # brownian increments
                                          W = np.cumsum(b)                             # brownian path
                                          return W, b
                                      def GBM(So, mu, sigma, W, T, N):
                                          t = np.linspace(0.,1.,N+1)
                                          S = []
                                          S.append(So)
                                          for i in range(1,int(N+1)):
                                              drift = (mu - 0.5 * sigma**2) * t[i]
                                              diffusion = sigma * W[i-1]
                                              S_temp = So*np.exp(drift + diffusion)
                                              S.append(S_temp)
                                          return S, t
                                      
                                      In [ ]:
                                      mean=np.mean(factorsReturns[0])
                                      std=np.std(factorsReturns[0])
                                      S0=factorsReturns[0][0]
                                      N=1000000
                                      W=Brownian(seed,N)[0]
                                      T=len(factorsReturns[0])+50
                                      soln = GBM(S0, mean, std, W, T, N)[0]    # Exact solution
                                      t = GBM(S0, mean, std, W, T, N)[1]       # time increments for  pl
                                      
                                      In [ ]:
                                      plt.plot(t, soln)
                                      plt.ylabel('Stock Price, $')
                                      plt.title('Geometric Brownian Motion')
                                      

                                      Gausian Processes

                                      In [ ]:
                                      from sklearn import gaussian_process
                                      from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
                                      
                                      In [ ]:
                                      kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)
                                      
                                      In [ ]:
                                      factorsReturns.shape
                                      
                                      In [ ]:
                                      stocks[0][0]
                                      
                                      In [ ]:
                                      len(stocksReturns[0])
                                      
                                      In [ ]:
                                      gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
                                      gp.fit(np.arange(len(stocksReturns[0])).reshape(-1,1),stocksReturns[0])
                                      
                                      In [ ]:
                                      gp.kernel_
                                      
                                      In [ ]:
                                      x_pred = np.linspace(1305, 1310).reshape(-1,1)
                                      y_pred, sigma = gp.predict(x_pred, return_std=True)
                                      plt.plot(y_pred)
                                      plt.show()
                                      
                                      In [ ]:
                                      import numpy as np
                                      from sklearn.gaussian_process import GaussianProcessRegressor
                                      from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
                                      
                                      In [ ]:
                                      # Instanciate a Gaussian Process model
                                      kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
                                      gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
                                      
                                      # Fit to data using Maximum Likelihood Estimation of the parameters
                                      gp.fit(np.arange(len(stocksReturns[0])).reshape(-1,1),stocksReturns[0])
                                      
                                      # Make the prediction on the meshed x-axis (ask for MSE as well)
                                      y_pred, sigma = gp.predict(stocksReturns[0], return_std=True)
                                      

                                      References